Top Related Projects
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Quick Overview
FairScale is a PyTorch extension library for high-performance and large-scale training. It provides tools for distributed training, memory-efficient training, and model parallelism. FairScale is designed to work seamlessly with PyTorch and offers various optimization techniques to improve training efficiency for large models and datasets.
Pros
- Seamless integration with PyTorch ecosystem
- Supports various distributed training techniques (e.g., FullyShardedDataParallel, OffloadModel)
- Provides memory-efficient training methods for large models
- Actively maintained by Facebook AI Research team
Cons
- Requires advanced knowledge of distributed training concepts
- May have a steeper learning curve for beginners
- Limited documentation compared to more established libraries
- Some features may be experimental or in beta stage
Code Examples
- Using FullyShardedDataParallel for efficient distributed training:
import torch
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
model = YourModel()
model = FSDP(model)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for data, target in dataloader:
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
- Implementing Offload Model for memory-efficient training:
from fairscale.experimental.nn.offload import OffloadModel
model = YourLargeModel()
offload_model = OffloadModel(
model=model,
device=torch.device("cuda"),
offload_device=torch.device("cpu")
)
for data, target in dataloader:
output = offload_model(data)
loss = criterion(output, target)
loss.backward()
offload_model.optimizer_step()
- Using Pipe for model parallelism:
from fairscale.nn import Pipe
class ExamplePipelineModule(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(100, 100),
nn.ReLU(),
nn.Linear(100, 10)
)
def forward(self, x):
return self.layers(x)
model = Pipe(ExamplePipelineModule(), chunks=8)
output = model(input)
Getting Started
To get started with FairScale, follow these steps:
- Install FairScale:
pip install fairscale
- Import and use FairScale components in your PyTorch code:
import torch
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
model = YourModel()
model = FSDP(model)
# Continue with your regular PyTorch training loop
For more detailed usage and advanced features, refer to the FairScale documentation and examples in the GitHub repository.
Competitor Comparisons
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- Larger community and ecosystem, with more resources and third-party libraries
- More comprehensive documentation and tutorials
- Broader range of features and use cases beyond distributed training
Cons of PyTorch
- Steeper learning curve for distributed training
- Less optimized for large-scale distributed scenarios out-of-the-box
- More complex setup for distributed training across multiple nodes
Code Comparison
PyTorch distributed training setup:
import torch.distributed as dist
import torch.multiprocessing as mp
def main():
mp.spawn(train, nprocs=world_size, args=(world_size,))
def train(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
FairScale distributed training setup:
from fairscale.nn.data_parallel import ShardedDataParallel
from fairscale.optim.oss import OSS
model = ShardedDataParallel(model, sharded_optimizer=True)
optimizer = OSS(model.parameters(), lr=0.1)
FairScale provides a more streamlined approach for distributed training, with built-in support for advanced techniques like sharded data parallelism and optimized memory usage. PyTorch offers more flexibility but requires more manual configuration for advanced distributed scenarios.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Pros of Horovod
- Broader framework support (TensorFlow, PyTorch, MXNet, Keras)
- More mature and battle-tested in production environments
- Better support for distributed training across multiple nodes
Cons of Horovod
- Steeper learning curve for beginners
- Less integrated with PyTorch ecosystem
- Requires more manual configuration for advanced use cases
Code Comparison
Horovod:
import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
FairScale:
from fairscale.nn.data_parallel import ShardedDataParallel
from fairscale.optim.oss import OSS
model = ShardedDataParallel(model, sharded_optimizer=True)
optimizer = OSS(params=model.parameters(), optim=torch.optim.Adam)
Both Horovod and FairScale are powerful libraries for distributed deep learning, but they have different strengths. Horovod offers broader framework support and is more established, while FairScale provides tighter integration with PyTorch and easier setup for some use cases. The choice between them depends on your specific requirements and the frameworks you're using.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- More comprehensive optimization techniques, including ZeRO-Offload and ZeRO-Infinity
- Better integration with Azure and other Microsoft cloud services
- Extensive documentation and tutorials for ease of use
Cons of DeepSpeed
- Steeper learning curve due to more complex features
- Less flexible for non-PyTorch frameworks compared to FairScale
Code Comparison
FairScale example:
from fairscale.nn.data_parallel import ShardedDataParallel
model = ShardedDataParallel(model, optimizer)
DeepSpeed example:
import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
args=args, model=model, model_parameters=params
)
Both libraries aim to improve large-scale model training, but DeepSpeed offers more advanced features and optimizations. FairScale is more lightweight and easier to integrate into existing PyTorch projects. DeepSpeed provides better performance for very large models and distributed training scenarios, while FairScale offers simpler implementation for basic distributed training needs.
DeepSpeed's integration with Azure can be advantageous for users already in the Microsoft ecosystem. However, FairScale's simplicity and focus on PyTorch compatibility make it a solid choice for many deep learning projects.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Pros of Apex
- Optimized for NVIDIA GPUs, providing better performance on NVIDIA hardware
- Includes CUDA-specific optimizations and kernels
- More mature project with longer development history
Cons of Apex
- Limited compatibility with non-NVIDIA hardware
- Less focus on distributed training and large-scale models
- Slower release cycle and updates compared to FairScale
Code Comparison
Apex:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
FairScale:
from fairscale.nn.data_parallel import ShardedDataParallel
model = ShardedDataParallel(model, sharded_optimizer=optimizer)
FairScale focuses more on distributed training and large-scale models, offering features like Sharded Data Parallel and Fully Sharded Data Parallel. Apex, on the other hand, provides mixed precision training and CUDA-specific optimizations.
FairScale is more hardware-agnostic and actively developed, with a focus on scaling to large models and distributed setups. Apex excels in NVIDIA-specific optimizations but may have limited compatibility with other hardware.
Both libraries aim to improve training efficiency and performance, but they approach it from different angles. The choice between them depends on your specific hardware setup, model size, and training requirements.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Description
FairScale is a PyTorch extension library for high performance and large scale training. This library extends basic PyTorch capabilities while adding new SOTA scaling techniques. FairScale makes available the latest distributed training techniques in the form of composable modules and easy to use APIs. These APIs are a fundamental part of a researcher's toolbox as they attempt to scale models with limited resources.
FairScale was designed with the following values in mind:
-
Usability - Users should be able to understand and use FairScale APIs with minimum cognitive overload.
-
Modularity - Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
-
Performance - FairScale APIs provide the best performance in terms of scaling and efficiency.
Watch Introductory Video
Installation
To install FairScale, please see the following instructions. You should be able to install a package with pip or conda, or build directly from source.
Getting Started
The full documentation contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.
FSDP
FullyShardedDataParallel (FSDP) is the recommended method for scaling to large NN models. This library has been upstreamed to PyTorch. The version of FSDP here is for historical references as well as for experimenting with new and crazy ideas in research of scaling techniques. Please see the following blog for how to use FairScale FSDP and how does it work.
Testing
We use circleci to test FairScale with the following PyTorch versions (with CUDA 11.2):
- the latest stable release (e.g. 1.10.0)
- the latest LTS release (e.g. 1.8.1)
- a recent nightly release (e.g. 1.11.0.dev20211101+cu111)
Please create an issue if you are having trouble with installation.
Contributors
We welcome contributions! Please see the CONTRIBUTING instructions for how you can contribute to FairScale.
License
FairScale is licensed under the BSD-3-Clause License.
fairscale.nn.pipe is forked from torchgpipe, Copyright 2019, Kakao Brain, licensed under Apache License.
fairscale.nn.model_parallel is forked from Megatron-LM, Copyright 2020, NVIDIA CORPORATION, licensed under Apache License.
fairscale.optim.adascale is forked from AdaptDL, Copyright 2020, Petuum, Inc., licensed under Apache License.
fairscale.nn.misc.flatten_params_wrapper is forked from PyTorch-Reparam-Module, Copyright 2018, Tongzhou Wang, licensed under MIT License.
Citing FairScale
If you use FairScale in your publication, please cite it by using the following BibTeX entry.
@Misc{FairScale2021,
author = {{FairScale authors}},
title = {FairScale: A general purpose modular PyTorch library for high performance and large scale training},
howpublished = {\url{https://github.com/facebookresearch/fairscale}},
year = {2021}
}
Top Related Projects
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot