accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

8,667

1,084

8,667

118

View on GitHub

Top Related Projects

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

horovod

14,454

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

pytorch-lightning

29,689

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

Quick Overview

Hugging Face's Accelerate is a library designed to simplify the process of training and using PyTorch models across various distributed environments and hardware setups. It provides a unified API for running machine learning code on different platforms, including CPUs, GPUs, and TPUs, without requiring significant code changes.

Pros

Easy integration with existing PyTorch code
Supports multiple distributed training paradigms (DataParallel, DistributedDataParallel, etc.)
Automatic mixed precision and gradient accumulation
Seamless integration with Hugging Face's ecosystem (Transformers, Datasets, etc.)

Cons

Primarily focused on PyTorch, limiting its use with other frameworks
May introduce a small overhead in some cases
Learning curve for users unfamiliar with distributed training concepts
Some advanced features may require additional configuration

Code Examples

Basic training loop with Accelerate:

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, scheduler
)

for epoch in range(num_epochs):
    for batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()

Launching a script with Accelerate CLI:

accelerate launch --multi_gpu train.py

Using Accelerate with Hugging Face Transformers:

from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, AutoTokenizer

accelerator = Accelerator()
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = accelerator.prepare(model)

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

Getting Started

To get started with Accelerate, first install it using pip:

pip install accelerate

Then, modify your PyTorch training script to use Accelerate:

from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

for epoch in range(num_epochs):
    for batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()

Finally, run your script using the Accelerate CLI:

accelerate config  # Configure your environment
accelerate launch your_script.py

Competitor Comparisons

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

More advanced optimization techniques like ZeRO-3 and pipeline parallelism
Better suited for very large models and distributed training
Offers more fine-grained control over training optimizations

Cons of DeepSpeed

Steeper learning curve and more complex setup
Less seamless integration with Hugging Face ecosystem
Requires more manual configuration for optimal performance

Code Comparison

Accelerate:

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, training_dataloader = accelerator.prepare(
    model, optimizer, training_dataloader
)

DeepSpeed:

import deepspeed

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config=ds_config
)

Both libraries aim to simplify distributed and efficient training of large models, but they differ in their approach and target use cases. Accelerate focuses on ease of use and seamless integration with PyTorch and Hugging Face libraries, making it a great choice for many users. DeepSpeed, on the other hand, offers more advanced optimization techniques and is better suited for very large models and distributed training scenarios, but comes with a steeper learning curve and more complex setup.

horovod

14,454

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

Designed specifically for distributed deep learning, offering better performance for large-scale training
Supports multiple deep learning frameworks (TensorFlow, PyTorch, MXNet)
Provides advanced features like gradient compression and hierarchical allreduce

Cons of Horovod

Steeper learning curve and more complex setup compared to Accelerate
Requires more manual configuration for different distributed training scenarios
Less integrated with popular libraries like Transformers

Code Comparison

Horovod:

import horovod.torch as hvd
hvd.init()
torch.cuda.set_device(hvd.local_rank())
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

Accelerate:

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, training_dataloader = accelerator.prepare(model, optimizer, training_dataloader)

Accelerate offers a more straightforward API for distributed training, while Horovod provides more fine-grained control and optimization options for advanced users. Accelerate integrates seamlessly with Hugging Face ecosystem, making it easier to use with popular NLP models and datasets.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

More comprehensive toolkit for sequence modeling tasks
Supports a wider range of architectures and models
Offers more advanced features for research and experimentation

Cons of fairseq

Steeper learning curve and more complex setup
Less focus on ease of use and quick deployment
May be overkill for simpler projects or beginners

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
tokens = model.encode('Hello world')
output = model.decode(tokens)

accelerate:

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
with accelerator.accumulate(model):
    loss = model(batch)
    accelerator.backward(loss)

Summary

fairseq is a more comprehensive toolkit for sequence modeling, offering a wide range of features and models. It's ideal for advanced research and complex projects. accelerate, on the other hand, focuses on simplifying distributed training and deployment, making it more accessible for quick projects and beginners. The choice between them depends on the specific needs of your project and your level of expertise in the field.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

More comprehensive NLP-specific toolkit with pre-built models and datasets
Extensive documentation and tutorials for NLP tasks
Stronger focus on research-oriented features and experimentation

Cons of AllenNLP

Steeper learning curve for beginners
Less flexible for general deep learning tasks outside of NLP
Slower development cycle and updates compared to Accelerate

Code Comparison

AllenNLP:

from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import TextField
from allennlp.data.token_indexers import SingleIdTokenIndexer

class MyDatasetReader(DatasetReader):
    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as f:
            for line in f:
                yield self.text_to_instance(line.strip())

Accelerate:

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
    model, optimizer, training_dataloader, scheduler
)

for batch in training_dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    accelerator.backward(outputs.loss)
    optimizer.step()

pytorch-lightning

29,689

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

Pros of pytorch-lightning

More comprehensive framework with built-in support for distributed training, logging, and checkpointing
Highly modular and customizable architecture
Extensive ecosystem with plugins and integrations

Cons of pytorch-lightning

Steeper learning curve due to its more opinionated structure
May introduce overhead for simpler projects
Requires adapting existing code to fit the Lightning paradigm

Code Comparison

pytorch-lightning:

class MyModel(LightningModule):
    def training_step(self, batch, batch_idx):
        loss = self.model(batch)
        self.log('train_loss', loss)
        return loss

accelerate:

model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)
for batch in train_dataloader:
    loss = model(batch)
    accelerator.backward(loss)

Both libraries aim to simplify distributed training and device management in PyTorch. Accelerate focuses on providing a lightweight wrapper for existing PyTorch code, making it easier to scale training across devices with minimal changes. pytorch-lightning offers a more structured approach, encouraging best practices and providing a full-featured training framework.

The choice between the two depends on project requirements, existing codebase, and desired level of abstraction. Accelerate is ideal for quick scaling of existing code, while pytorch-lightning shines in larger, more complex projects that benefit from its extensive features and ecosystem.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Run your raw PyTorch training script on any kind of device

Easy to integrate

ð¤ Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

ð¤ Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged.

Here is an example:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
          source = source.to(device)
          targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

As you can see in this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp8, fp16, bf16).

In particular, the same code can then be run without modification on your local machine for debugging or your training environment.

ð¤ Accelerate even handles the device placement for you (which requires a few more changes to your code, but is safer in general), so you can even simplify your training loop further:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

- device = 'cpu'
+ accelerator = Accelerator()

- model = torch.nn.Transformer().to(device)
+ model = torch.nn.Transformer()
  optimizer = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
-         source = source.to(device)
-         targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

Want to learn more? Check out the documentation or have a look at our examples.

Launching script

ð¤ Accelerate also provides an optional CLI tool that allows you to quickly configure and test your training environment before launching the scripts. No need to remember how to use torch.distributed.run or to write a specific launcher for TPU training! On your machine(s) just run:

accelerate config

and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the GLUE example on the MRPC task (from the root of the repo):

accelerate launch examples/nlp_example.py

This CLI tool is optional, and you can still use python my_script.py or python -m torchrun my_script.py at your convenience.

You can also directly pass in the arguments you would to torchrun as arguments to accelerate launch if you wish to not run accelerate config.

For example, here is how to launch on two GPUs:

accelerate launch --multi_gpu --num_processes 2 examples/nlp_example.py

To learn more, check the CLI documentation available here.

Or view the configuration zoo here

Launching multi-CPU run using MPI

ð¤ Here is another way to launch multi-CPU run using MPI. You can learn how to install Open MPI on this page. You can use Intel MPI or MVAPICH as well. Once you have MPI setup on your cluster, just run:

accelerate config

Answer the questions that are asked, selecting to run using multi-CPU, and answer "yes" when asked if you want accelerate to launch mpirun. Then, use accelerate launch with your script like:

accelerate launch examples/nlp_example.py

Alternatively, you can use mpirun directly, without using the CLI like:

mpirun -np 2 python examples/nlp_example.py

Launching training using DeepSpeed

ð¤ Accelerate supports training on single/multiple GPUs using DeepSpeed. To use it, you don't need to change anything in your training code; you can set everything using just accelerate config. However, if you desire to tweak your DeepSpeed related args from your Python script, we provide you the DeepSpeedPlugin.

from accelerate import Accelerator, DeepSpeedPlugin

# deepspeed needs to know your gradient accumulation steps beforehand, so don't forget to pass it
# Remember you still need to do gradient accumulation by yourself, just like you would have done without deepspeed
deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)
accelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)

# How to save your ð¤ Transformer?
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))

Note: DeepSpeed support is experimental for now. In case you get into some problem, please open an issue.

Launching your training from a notebook

ð¤ Accelerate also provides a notebook_launcher function you can use in a notebook to launch a distributed training. This is especially useful for Colab or Kaggle notebooks with a TPU backend. Just define your training loop in a training_function then in your last cell, add:

from accelerate import notebook_launcher

notebook_launcher(training_function)

An example can be found in this notebook.

Why should I use ð¤ Accelerate?

You should use ð¤ Accelerate when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop. This is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library. In fact, the whole API of ð¤ Accelerate is in one class, the Accelerator object.

Why shouldn't I use ð¤ Accelerate?

You shouldn't use ð¤ Accelerate if you don't want to write a training loop yourself. There are plenty of high-level libraries above PyTorch that will offer you that, ð¤ Accelerate is not one of them.

Frameworks using ð¤ Accelerate

If you like the simplicity of ð¤ Accelerate but would prefer a higher-level abstraction around its capabilities, some frameworks and libraries that are built on top of ð¤ Accelerate are listed below:

Amphion is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
Animus is a minimalistic framework to run machine learning experiments. Animus highlights common "breakpoints" in ML experiments and provides a unified interface for them within IExperiment.
Catalyst is a PyTorch framework for Deep Learning Research and Development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write yet another train loop. Catalyst provides a Runner to connect all parts of the experiment: hardware backend, data transformations, model training, and inference logic.
fastai is a PyTorch framework for Deep Learning that simplifies training fast and accurate neural nets using modern best practices. fastai provides a Learner to handle the training, fine-tuning, and inference of deep learning algorithms.
Finetuner is a service that enables models to create higher-quality embeddings for semantic search, visual similarity search, cross-modal text<->image search, recommendation systems, clustering, duplication detection, anomaly detection, or other uses.
InvokeAI is a creative engine for Stable Diffusion models, offering industry-leading WebUI, terminal usage support, and serves as the foundation for many commercial products.
Kornia is a differentiable library that allows classical computer vision to be integrated into deep learning models. Kornia provides a Trainer with the specific purpose to train and fine-tune the supported deep learning algorithms within the library.
Open Assistant is a chat-based assistant that understands tasks, can interact with their party systems, and retrieve information dynamically to do so.
pytorch-accelerated is a lightweight training library, with a streamlined feature set centered around a general-purpose Trainer, that places a huge emphasis on simplicity and transparency; enabling users to understand exactly what is going on under the hood, but without having to write and maintain the boilerplate themselves!
Stable Diffusion web UI is an open-source browser-based easy-to-use interface based on the Gradio library for Stable Diffusion.
torchkeras is a simple tool for training pytorch model just in a keras style, a dynamic and beautiful plot is provided in notebook to monitor your loss or metric.
transformers as a tool for helping train state-of-the-art machine learning models in PyTorch, Tensorflow, and JAX. (Accelerate is the backend for the PyTorch side).

Installation

This repository is tested on Python 3.8+ and PyTorch 1.10.0+

You should install ð¤ Accelerate in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

First, create a virtual environment with the version of Python you're going to use and activate it.

Then, you will need to install PyTorch: refer to the official installation page regarding the specific install command for your platform. Then ð¤ Accelerate can be installed using pip as follows:

pip install accelerate

Supported integrations

CPU only
multi-CPU on one node (machine)
multi-CPU on several nodes (machines)
single GPU
multi-GPU on one node (machine)
multi-GPU on several nodes (machines)
TPU
FP16/BFloat16 mixed precision
FP8 mixed precision with Transformer Engine or MS-AMP
DeepSpeed support (Experimental)
PyTorch Fully Sharded Data Parallel (FSDP) support (Experimental)
Megatron-LM support (Experimental)

Citing ð¤ Accelerate

If you use ð¤ Accelerate in your publication, please cite it by using the following BibTeX entry.

@Misc{accelerate,
  title =        {Accelerate: Training and inference at scale made simple, efficient and adaptable.},
  author =       {Sylvain Gugger and Lysandre Debut and Thomas Wolf and Philipp Schmid and Zachary Mueller and Sourab Mangrulkar and Marc Sun and Benjamin Bossan},
  howpublished = {\url{https://github.com/huggingface/accelerate}},
  year =         {2022}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of DeepSpeed

Cons of DeepSpeed

Code Comparison

Pros of Horovod

Cons of Horovod

Code Comparison

Pros of fairseq

Cons of fairseq

Code Comparison

Summary

Pros of AllenNLP

Cons of AllenNLP

Code Comparison

Pros of pytorch-lightning

Cons of pytorch-lightning

Code Comparison

Convert designs to code with AI

README

Run your *raw* PyTorch training script on any kind of device

Easy to integrate

Launching script

Launching multi-CPU run using MPI

Launching training using DeepSpeed

Launching your training from a notebook

Why should I use ð¤ Accelerate?

Why shouldn't I use ð¤ Accelerate?

Frameworks using ð¤ Accelerate

Installation

Supported integrations

Citing ð¤ Accelerate

Top Related Projects

Convert designs to code with AI

Run your raw PyTorch training script on any kind of device

Why should I use ð¤ Accelerate?

Why shouldn't I use ð¤ Accelerate?

Frameworks using ð¤ Accelerate

Citing ð¤ Accelerate