xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.

9,651

687

9,651

323

View on GitHub

Top Related Projects

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

apex

8,693

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

horovod

14,559

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

bert

39,267

TensorFlow code and pre-trained models for BERT

Quick Overview

xformers is a library developed by Facebook Research that focuses on optimizing and extending Transformer models. It provides a collection of composable Transformer building blocks, efficient attention mechanisms, and various optimizations to improve the performance and flexibility of Transformer-based architectures.

Pros

Highly modular and composable architecture for building custom Transformer models
Implements efficient attention mechanisms and optimizations for improved performance
Supports both PyTorch and JAX/Flax frameworks
Actively maintained and regularly updated by Facebook Research

Cons

Steeper learning curve compared to more straightforward Transformer implementations
Documentation can be sparse or outdated in some areas
May require more setup and configuration compared to simpler libraries
Some features are still experimental and subject to change

Code Examples

Creating a basic Transformer encoder:

import torch
from xformers.factory import xFormerEncoderConfig, xFormerEncoder

config = xFormerEncoderConfig(
    dim_model=512,
    num_layers=6,
    multi_head_config={
        "num_heads": 8,
        "dim_head": 64,
        "residual_dropout": 0.1,
    },
    feedforward_config={
        "dim_feedforward": 2048,
        "activation": "relu",
        "dropout": 0.1,
    },
)

encoder = xFormerEncoder(config)
x = torch.randn(32, 100, 512)  # (batch_size, seq_len, dim_model)
output = encoder(x)

Using an efficient attention mechanism:

from xformers.components import MultiHeadDispatch
from xformers.components.attention import ScaledDotProduct

efficient_attention = MultiHeadDispatch(
    dim_model=512,
    num_heads=8,
    attention_cls=ScaledDotProduct,
    attention_kwargs={"dropout": 0.1},
)

q = k = v = torch.randn(32, 100, 512)
output = efficient_attention(q, k, v)

Applying memory-efficient attention:

from xformers.ops import memory_efficient_attention

q = k = v = torch.randn(32, 100, 512)
output = memory_efficient_attention(q, k, v)

Getting Started

To get started with xformers, follow these steps:

Install xformers:

pip install xformers

Import and use xformers components in your PyTorch project:

import torch
from xformers.components import MultiHeadDispatch
from xformers.components.attention import ScaledDotProduct

# Create an efficient attention mechanism
attention = MultiHeadDispatch(
    dim_model=512,
    num_heads=8,
    attention_cls=ScaledDotProduct,
)

# Use the attention mechanism
x = torch.randn(32, 100, 512)
output = attention(x, x, x)

For more advanced usage and customization, refer to the xformers documentation and examples in the GitHub repository.

Competitor Comparisons

pytorch

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Broader scope and functionality, covering a wide range of deep learning tasks
Larger community and ecosystem, with more resources and third-party libraries
More mature and stable, with regular updates and long-term support

Cons of PyTorch

Larger codebase and installation size, potentially slower for specific use cases
Steeper learning curve for beginners due to its comprehensive feature set
May have higher memory usage for certain operations compared to optimized libraries

Code Comparison

PyTorch:

import torch

x = torch.randn(3, 3)
y = torch.matmul(x, x.t())
z = torch.relu(y)

xformers:

import torch
from xformers.ops import memory_efficient_attention

q, k, v = torch.randn(3, 16, 8).chunk(3, dim=-1)
output = memory_efficient_attention(q, k, v)

xformers focuses on efficient transformer implementations, while PyTorch provides a more general-purpose deep learning framework. xformers offers memory-efficient attention operations, which can be beneficial for large-scale transformer models. PyTorch, on the other hand, provides a comprehensive set of tools for various deep learning tasks beyond transformers.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of transformers

Extensive model support: Covers a wide range of transformer architectures and pre-trained models
Rich documentation and community support
Easy-to-use high-level APIs for various NLP tasks

Cons of transformers

Can be slower for certain operations compared to xformers
May have higher memory usage for large models
Less focus on performance optimizations for specific hardware

Code comparison

transformers:

from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

xformers:

import torch
from xformers.components import MultiHeadDispatch

mha = MultiHeadDispatch(
    dim_model=512,
    num_heads=8,
    attention="scaled_dot_product"
)
q, k, v = torch.rand(3, 1, 4, 512)
output = mha(q, k, v)

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

More comprehensive optimization toolkit for distributed deep learning
Supports a wider range of hardware configurations and cloud platforms
Offers advanced features like ZeRO-Offload and 3D parallelism

Cons of DeepSpeed

Steeper learning curve due to more complex configuration options
May require more setup and tuning for optimal performance
Less focused on specific transformer optimizations

Code Comparison

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=params
)

xformers:

from xformers.factory import xFormerEncoderConfig, xFormerEncoder
encoder = xFormerEncoder(xFormerEncoderConfig(
    dim_model=512,
    num_layers=6,
    multi_head_config={"num_heads": 8, "dim_head": 64}
))

Key Differences

DeepSpeed focuses on large-scale distributed training and inference optimization
xformers specializes in efficient transformer implementations and memory optimizations
DeepSpeed offers a more comprehensive suite of tools for end-to-end deep learning workflows
xformers provides more flexibility in customizing transformer architectures

Both libraries aim to improve the efficiency of deep learning models, but they approach the problem from different angles. DeepSpeed is better suited for large-scale distributed training, while xformers excels in optimizing transformer-specific operations.

apex

8,693

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Pros of Apex

Mature and well-established library with extensive NVIDIA GPU optimizations
Supports mixed precision training and distributed training out of the box
Offers a wider range of optimization techniques beyond just transformers

Cons of Apex

Limited to NVIDIA GPUs, reducing portability across different hardware
Requires separate installation and setup, which can be complex
Less focused on transformer-specific optimizations compared to xformers

Code Comparison

Apex (Mixed Precision Training):

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

xformers (Memory Efficient Attention):

from xformers.components import Attention

attention = Attention(dim, num_heads, attention_dropout=0.1)
output = attention(query, key, value)

Both libraries aim to improve performance and efficiency in deep learning tasks, but they focus on different aspects. Apex provides a broader set of optimization tools for NVIDIA GPUs, while xformers specializes in transformer-specific optimizations with a focus on memory efficiency and hardware flexibility.

horovod

14,559

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

Designed specifically for distributed deep learning, offering excellent scalability across multiple GPUs and nodes
Supports multiple deep learning frameworks (TensorFlow, PyTorch, MXNet) with a unified API
Integrates well with existing codebases, requiring minimal changes to enable distributed training

Cons of Horovod

Primarily focused on data parallelism, lacking built-in support for model parallelism
May have a steeper learning curve for users not familiar with distributed computing concepts
Less emphasis on memory efficiency optimizations compared to xformers

Code Comparison

Horovod (distributed training):

import horovod.torch as hvd
hvd.init()
optimizer = optim.SGD(model.parameters())
optimizer = hvd.DistributedOptimizer(optimizer)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

xformers (memory-efficient attention):

from xformers.components import Attention
attention = Attention(dim, num_heads, attention_mechanism="linear")
output = attention(query, key, value)

bert

39,267

TensorFlow code and pre-trained models for BERT

Pros of BERT

Widely adopted and well-established in the NLP community
Extensive pre-trained models available for various languages and tasks
Comprehensive documentation and numerous tutorials available

Cons of BERT

Less flexible for non-NLP tasks compared to xformers
Higher computational requirements for fine-tuning and inference
Limited support for optimizations and custom attention mechanisms

Code Comparison

BERT example:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

xformers example:

import torch
from xformers.components import MultiHeadDispatch
attention = MultiHeadDispatch(
    dim_model=512,
    num_heads=8,
    attention_dropout=0.1,
    residual_dropout=0.1
)
q, k, v = torch.rand(3, 1, 16, 512)
output = attention(q, k, v)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

xFormers - Toolbox to Accelerate Research on Transformers

xFormers is:

Customizable building blocks: Independent/customizable building blocks that can be used without boilerplate code. The components are domain-agnostic and xFormers is used by researchers in vision, NLP and more.
Research first: xFormers contains bleeding-edge components, that are not yet available in mainstream libraries like PyTorch.
Built with efficiency in mind: Because speed of iteration matters, components are as fast and memory-efficient as possible. xFormers contains its own CUDA kernels, but dispatches to other libraries when relevant.

Installing xFormers

(RECOMMENDED, linux & win) Install latest stable with pip: Requires PyTorch 2.7.0

# [linux only] cuda 11.8 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
# [linux & win] cuda 12.6 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu126
# [linux & win] cuda 12.8 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128
# [linux only] (EXPERIMENTAL) rocm 6.3 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/rocm6.3

Development binaries:

# Same requirements as for the stable version above
pip install --pre -U xformers

Install from source: If you want to use with another version of PyTorch for instance (including nightly-releases)

# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
# NOTE: pytorch must already be installed!
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

Benchmarks

Memory-efficient MHA Benchmarks for ViTS Setup: A100 on f16, measured total time for a forward+backward pass

Note that this is exact attention, not an approximation, just by calling xformers.ops.memory_efficient_attention

More benchmarks

xFormers provides many components, and more benchmarks are available in BENCHMARKS.md.

(Optional) Testing the installation

This command will provide information on an xFormers installation, and what kernels are built/available:

python -m xformers.info

Using xFormers

Key Features

Optimized building blocks, beyond PyTorch primitives
1. Memory-efficient exact attention - up to 10x faster
2. sparse attention
3. block-sparse attention
4. fused softmax
5. fused linear layer
6. fused layer norm
7. fused dropout(activation(x+bias))
8. fused SwiGLU

Install troubleshooting

NVCC and the current CUDA runtime match. Depending on your setup, you may be able to change the CUDA runtime with module unload cuda; module load cuda/xx.x, possibly also nvcc
the version of GCC that you're using matches the current NVCC capabilities
the TORCH_CUDA_ARCH_LIST env variable is set to the architectures that you want to support. A suggested setup (slow to build but comprehensive) is export TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.2;7.5;8.0;8.6"
If the build from source OOMs, it's possible to reduce the parallelism of ninja with MAX_JOBS (eg MAX_JOBS=2)

License

xFormers has a BSD-style license, as found in the LICENSE file. It includes code from the triton-lang/kernels repo.

Citing xFormers

If you use xFormers in your publication, please cite it by using the following BibTeX entry.

@Misc{xFormers2022,
  author =       {Benjamin Lefaudeux and Francisco Massa and Diana Liskovich and Wenhan Xiong and Vittorio Caggiano and Sean Naren and Min Xu and Jieru Hu and Marta Tintore and Susan Zhang and Patrick Labatut and Daniel Haziza and Luca Wehrstedt and Jeremy Reizenstein and Grigory Sizov},
  title =        {xFormers: A modular and hackable Transformer modelling library},
  howpublished = {\url{https://github.com/facebookresearch/xformers}},
  year =         {2022}
}

Credits

The following repositories are used in xFormers, either in close to original form or as an inspiration:

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot