fairscale

PyTorch extensions for high performance and large scale training.

3,171

277

3,171

103

View on GitHub

Top Related Projects

pytorch

82,049

Tensors and Dynamic neural networks in Python with strong GPU acceleration

horovod

14,221

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

DeepSpeed

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

apex

8,290

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Quick Overview

FairScale is a PyTorch extension library for high-performance and large-scale training. It provides tools for distributed training, memory-efficient training, and model parallelism. FairScale is designed to work seamlessly with PyTorch and offers various optimization techniques to improve training efficiency for large models and datasets.

Pros

Seamless integration with PyTorch ecosystem
Supports various distributed training techniques (e.g., FullyShardedDataParallel, OffloadModel)
Provides memory-efficient training methods for large models
Actively maintained by Facebook AI Research team

Cons

Requires advanced knowledge of distributed training concepts
May have a steeper learning curve for beginners
Limited documentation compared to more established libraries
Some features may be experimental or in beta stage

Code Examples

Using FullyShardedDataParallel for efficient distributed training:

import torch
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP

model = YourModel()
model = FSDP(model)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for data, target in dataloader:
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Implementing Offload Model for memory-efficient training:

from fairscale.experimental.nn.offload import OffloadModel

model = YourLargeModel()
offload_model = OffloadModel(
    model=model,
    device=torch.device("cuda"),
    offload_device=torch.device("cpu")
)

for data, target in dataloader:
    output = offload_model(data)
    loss = criterion(output, target)
    loss.backward()
    offload_model.optimizer_step()

Using Pipe for model parallelism:

from fairscale.nn import Pipe

class ExamplePipelineModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 10)
        )

    def forward(self, x):
        return self.layers(x)

model = Pipe(ExamplePipelineModule(), chunks=8)
output = model(input)

Getting Started

To get started with FairScale, follow these steps:

Install FairScale:

pip install fairscale

Import and use FairScale components in your PyTorch code:

import torch
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP

model = YourModel()
model = FSDP(model)

# Continue with your regular PyTorch training loop

For more detailed usage and advanced features, refer to the FairScale documentation and examples in the GitHub repository.

Competitor Comparisons

pytorch

82,049

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Larger community and ecosystem, with more resources and third-party libraries
More comprehensive documentation and tutorials
Broader range of features and use cases beyond distributed training

Cons of PyTorch

Steeper learning curve for distributed training
Less optimized for large-scale distributed scenarios out-of-the-box
More complex setup for distributed training across multiple nodes

Code Comparison

PyTorch distributed training setup:

import torch.distributed as dist
import torch.multiprocessing as mp

def main():
    mp.spawn(train, nprocs=world_size, args=(world_size,))

def train(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

FairScale distributed training setup:

from fairscale.nn.data_parallel import ShardedDataParallel
from fairscale.optim.oss import OSS

model = ShardedDataParallel(model, sharded_optimizer=True)
optimizer = OSS(model.parameters(), lr=0.1)

FairScale provides a more streamlined approach for distributed training, with built-in support for advanced techniques like sharded data parallelism and optimized memory usage. PyTorch offers more flexibility but requires more manual configuration for advanced distributed scenarios.

horovod

14,221

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

Broader framework support (TensorFlow, PyTorch, MXNet, Keras)
More mature and battle-tested in production environments
Better support for distributed training across multiple nodes

Cons of Horovod

Steeper learning curve for beginners
Less integrated with PyTorch ecosystem
Requires more manual configuration for advanced use cases

Code Comparison

Horovod:

import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

FairScale:

from fairscale.nn.data_parallel import ShardedDataParallel
from fairscale.optim.oss import OSS
model = ShardedDataParallel(model, sharded_optimizer=True)
optimizer = OSS(params=model.parameters(), optim=torch.optim.Adam)

Both Horovod and FairScale are powerful libraries for distributed deep learning, but they have different strengths. Horovod offers broader framework support and is more established, while FairScale provides tighter integration with PyTorch and easier setup for some use cases. The choice between them depends on your specific requirements and the frameworks you're using.

DeepSpeed

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

More comprehensive optimization techniques, including ZeRO-Offload and ZeRO-Infinity
Better integration with Azure and other Microsoft cloud services
Extensive documentation and tutorials for ease of use

Cons of DeepSpeed

Steeper learning curve due to more complex features
Less flexible for non-PyTorch frameworks compared to FairScale

Code Comparison

FairScale example:

from fairscale.nn.data_parallel import ShardedDataParallel
model = ShardedDataParallel(model, optimizer)

DeepSpeed example:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args, model=model, model_parameters=params
)

Both libraries aim to improve large-scale model training, but DeepSpeed offers more advanced features and optimizations. FairScale is more lightweight and easier to integrate into existing PyTorch projects. DeepSpeed provides better performance for very large models and distributed training scenarios, while FairScale offers simpler implementation for basic distributed training needs.

DeepSpeed's integration with Azure can be advantageous for users already in the Microsoft ecosystem. However, FairScale's simplicity and focus on PyTorch compatibility make it a solid choice for many deep learning projects.

apex

8,290

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Pros of Apex

Optimized for NVIDIA GPUs, providing better performance on NVIDIA hardware
Includes CUDA-specific optimizations and kernels
More mature project with longer development history

Cons of Apex

Limited compatibility with non-NVIDIA hardware
Less focus on distributed training and large-scale models
Slower release cycle and updates compared to FairScale

Code Comparison

Apex:

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

FairScale:

from fairscale.nn.data_parallel import ShardedDataParallel
model = ShardedDataParallel(model, sharded_optimizer=optimizer)

FairScale focuses more on distributed training and large-scale models, offering features like Sharded Data Parallel and Fully Sharded Data Parallel. Apex, on the other hand, provides mixed precision training and CUDA-specific optimizations.

FairScale is more hardware-agnostic and actively developed, with a focus on scaling to large models and distributed setups. Apex excels in NVIDIA-specific optimizations but may have limited compatibility with other hardware.

Both libraries aim to improve training efficiency and performance, but they approach it from different angles. The choice between them depends on your specific hardware setup, model size, and training requirements.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Description

FairScale is a PyTorch extension library for high performance and large scale training. This library extends basic PyTorch capabilities while adding new SOTA scaling techniques. FairScale makes available the latest distributed training techniques in the form of composable modules and easy to use APIs. These APIs are a fundamental part of a researcher's toolbox as they attempt to scale models with limited resources.

FairScale was designed with the following values in mind:

Usability - Users should be able to understand and use FairScale APIs with minimum cognitive overload.
Modularity - Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
Performance - FairScale APIs provide the best performance in terms of scaling and efficiency.

Watch Introductory Video

Installation

To install FairScale, please see the following instructions. You should be able to install a package with pip or conda, or build directly from source.

Getting Started

The full documentation contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.

FSDP

FullyShardedDataParallel (FSDP) is the recommended method for scaling to large NN models. This library has been upstreamed to PyTorch. The version of FSDP here is for historical references as well as for experimenting with new and crazy ideas in research of scaling techniques. Please see the following blog for how to use FairScale FSDP and how does it work.

Testing

We use circleci to test FairScale with the following PyTorch versions (with CUDA 11.2):

the latest stable release (e.g. 1.10.0)
the latest LTS release (e.g. 1.8.1)
a recent nightly release (e.g. 1.11.0.dev20211101+cu111)

Please create an issue if you are having trouble with installation.

Contributors

We welcome contributions! Please see the CONTRIBUTING instructions for how you can contribute to FairScale.

License

FairScale is licensed under the BSD-3-Clause License.

Citing FairScale

If you use FairScale in your publication, please cite it by using the following BibTeX entry.

@Misc{FairScale2021,
  author =       {{FairScale authors}},
  title =        {FairScale:  A general purpose modular PyTorch library for high performance and large scale training},
  howpublished = {\url{https://github.com/facebookresearch/fairscale}},
  year =         {2021}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot