apex
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Top Related Projects
PyTorch extensions for high performance and large scale training.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Quick Overview
NVIDIA/apex is a PyTorch extension that provides tools for mixed precision and distributed training. It aims to improve performance and memory efficiency in deep learning workflows, particularly for large-scale models and datasets.
Pros
- Enables mixed precision training, which can significantly speed up computations and reduce memory usage
- Provides optimized CUDA kernels for common operations, enhancing performance on NVIDIA GPUs
- Offers easy-to-use distributed training utilities for multi-GPU and multi-node setups
- Integrates seamlessly with PyTorch, allowing for minimal code changes in existing projects
Cons
- Primarily focused on NVIDIA GPUs, limiting its usefulness for other hardware
- Requires careful tuning and understanding of mixed precision concepts for optimal results
- May introduce additional complexity to the training pipeline
- Some features may not be compatible with the latest PyTorch versions immediately upon release
Code Examples
- Initializing mixed precision training:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
- Using distributed data parallel with Apex:
from apex.parallel import DistributedDataParallel as DDP
model = DDP(model)
- Applying gradient clipping with Apex:
from apex import amp
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
amp.master_params(optimizer).clip_grad_norm_(max_norm=1.0)
- Using Apex's optimized layer normalization:
from apex.normalization import FusedLayerNorm
layer_norm = FusedLayerNorm(normalized_shape)
Getting Started
To get started with NVIDIA/apex, follow these steps:
- Install Apex:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
- Import and use Apex in your PyTorch code:
import torch
from apex import amp
# Define your model and optimizer
model = YourModel()
optimizer = torch.optim.Adam(model.parameters())
# Initialize mixed precision training
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
# Train your model using Apex features
for epoch in range(num_epochs):
for batch in dataloader:
optimizer.zero_grad()
output = model(batch)
loss = criterion(output, target)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
Competitor Comparisons
PyTorch extensions for high performance and large scale training.
Pros of fairscale
- More comprehensive distributed training support, including model parallelism and pipeline parallelism
- Broader compatibility across different hardware platforms, not limited to NVIDIA GPUs
- Active development and regular updates from Facebook AI Research team
Cons of fairscale
- May have a steeper learning curve due to more advanced features
- Potentially slower performance for some operations compared to Apex's CUDA-optimized implementations
- Less focus on mixed-precision training compared to Apex
Code Comparison
Apex (Mixed Precision Training):
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
fairscale (Sharded Data Parallel):
model = ShardedDataParallel(model, sharded_optimizer=optimizer)
output = model(input)
loss = criterion(output, target)
loss.backward()
Both libraries aim to improve training efficiency, but fairscale offers a wider range of distributed training techniques, while Apex focuses more on mixed-precision training and NVIDIA-specific optimizations.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- Offers a wider range of optimization techniques, including ZeRO, pipeline parallelism, and 1-bit Adam
- Provides better memory efficiency, allowing training of larger models on limited hardware
- Supports more flexible distributed training scenarios, including multi-node setups
Cons of DeepSpeed
- Has a steeper learning curve due to its more complex API and configuration options
- May require more setup and tuning to achieve optimal performance compared to Apex
Code Comparison
Apex:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
DeepSpeed:
model_engine, optimizer, _, _ = deepspeed.initialize(
args=args, model=model, model_parameters=model.parameters()
)
loss = model_engine(inputs)
model_engine.backward(loss)
model_engine.step()
Both libraries aim to optimize deep learning training, but DeepSpeed offers more advanced features and flexibility at the cost of increased complexity. Apex is simpler to use but may be limited in some scenarios. The choice between them depends on specific project requirements and hardware constraints.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Pros of Horovod
- Framework-agnostic: Works with TensorFlow, PyTorch, and MXNet
- Supports distributed training across multiple GPUs and nodes
- Easier to scale to large clusters and supercomputers
Cons of Horovod
- Requires more setup and configuration compared to Apex
- May have slightly higher overhead for single-node multi-GPU training
- Less integrated with NVIDIA-specific optimizations
Code Comparison
Apex:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
Horovod:
hvd.init()
optimizer = hvd.DistributedOptimizer(optimizer)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
loss.backward()
Both libraries aim to improve distributed training performance, but Apex focuses on mixed precision training and NVIDIA GPU optimizations, while Horovod emphasizes scalability across different frameworks and distributed environments. Apex is more tightly integrated with PyTorch and NVIDIA hardware, offering easier setup for single-node multi-GPU scenarios. Horovod provides greater flexibility for large-scale distributed training across various frameworks and hardware configurations.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- Broader ecosystem and community support
- More comprehensive documentation and tutorials
- Wider range of built-in features and functionalities
Cons of PyTorch
- Slower performance for certain operations compared to Apex
- Lacks some advanced mixed precision training features
- May require more memory for large-scale models
Code Comparison
PyTorch:
import torch
model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scaler = torch.cuda.amp.GradScaler()
for data, target in dataset:
optimizer.zero_grad()
with torch.cuda.amp.autocast():
output = model(data)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Apex:
import torch
from apex import amp
model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
for data, target in dataset:
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Pros of Accelerate
- Easier to use and more beginner-friendly
- Supports a wider range of hardware and platforms
- Integrates seamlessly with Hugging Face ecosystem
Cons of Accelerate
- May not offer the same level of performance optimization as Apex
- Less fine-grained control over mixed precision training
Code Comparison
Apex:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
Accelerate:
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, training_dataloader = accelerator.prepare(
model, optimizer, training_dataloader
)
Apex focuses on mixed precision training and optimization, while Accelerate provides a more general-purpose solution for distributed training and hardware acceleration. Apex offers more advanced features for performance tuning, but Accelerate is easier to integrate into existing projects and works across a broader range of hardware configurations.
Accelerate is designed to be more user-friendly and requires less code modification, making it a good choice for those new to distributed training or working with diverse hardware setups. Apex, on the other hand, may be preferred by users who need fine-grained control over mixed precision training and are willing to invest time in optimizing their models for NVIDIA GPUs.
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Pros of intel-extension-for-pytorch
- Optimized for Intel hardware, including CPUs and GPUs
- Supports a wider range of Intel-specific optimizations and features
- Integrates seamlessly with Intel's oneAPI toolkit for enhanced performance
Cons of intel-extension-for-pytorch
- Limited to Intel hardware, reducing flexibility for users with diverse hardware setups
- May have a smaller community and fewer resources compared to Apex
- Potentially slower adoption of new PyTorch features due to focus on Intel-specific optimizations
Code Comparison
apex:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
intel-extension-for-pytorch:
import intel_extension_for_pytorch as ipex
model = ipex.optimize(model)
Both extensions aim to improve PyTorch performance, but they target different hardware ecosystems. Apex focuses on NVIDIA GPUs and provides mixed precision training, while intel-extension-for-pytorch optimizes for Intel hardware. The code snippets demonstrate the simplicity of integrating these extensions into existing PyTorch projects, with slight differences in syntax and functionality.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Introduction
This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.
Full API Documentation: https://nvidia.github.io/apex
GTC 2019 and Pytorch DevCon 2019 Slides
Contents
1. Amp: Automatic Mixed Precision
Deprecated. Use PyTorch AMP
apex.amp
is a tool to enable mixed precision training by changing only 3 lines of your script.
Users can easily experiment with different pure and mixed precision training modes by supplying
different flags to amp.initialize
.
Webinar introducing Amp
(The flag cast_batchnorm
has been renamed to keep_batchnorm_fp32
).
Comprehensive Imagenet example
Moving to the new Amp API (for users of the deprecated "Amp" and "FP16_Optimizer" APIs)
2. Distributed Training
apex.parallel.DistributedDataParallel
is deprecated. Use torch.nn.parallel.DistributedDataParallel
apex.parallel.DistributedDataParallel
is a module wrapper, similar to
torch.nn.parallel.DistributedDataParallel
. It enables convenient multiprocess distributed training,
optimized for NVIDIA's NCCL communication library.
The Imagenet example
shows use of apex.parallel.DistributedDataParallel
along with apex.amp
.
Synchronized Batch Normalization
Deprecated. Use torch.nn.SyncBatchNorm
apex.parallel.SyncBatchNorm
extends torch.nn.modules.batchnorm._BatchNorm
to
support synchronized BN.
It allreduces stats across processes during multiprocess (DistributedDataParallel) training.
Synchronous BN has been used in cases where only a small
local minibatch can fit on each GPU.
Allreduced stats increase the effective batch size for the BN layer to the
global batch size across all processes (which, technically, is the correct
formulation).
Synchronous BN has been observed to improve converged accuracy in some of our research models.
Checkpointing
To properly save and load your amp
training, we introduce the amp.state_dict()
, which contains all loss_scalers
and their corresponding unskipped steps,
as well as amp.load_state_dict()
to restore these attributes.
In order to get bitwise accuracy, we recommend the following workflow:
# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
# Train your model
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
...
# Save checkpoint
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...
# Restore
model = ...
optimizer = ...
checkpoint = torch.load('amp_checkpoint.pt')
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
amp.load_state_dict(checkpoint['amp'])
# Continue training
...
Note that we recommend restoring the model using the same opt_level
. Also note that we recommend calling the load_state_dict
methods after amp.initialize
.
Installation
Each apex.contrib
module requires one or more install options other than --cpp_ext
and --cuda_ext
.
Note that contrib modules do not necessarily support stable PyTorch releases.
Containers
NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.
See the NGC documentation for details such as:
- how to pull a container
- how to run a pulled container
- release notes
From Source
To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.
The latest stable release obtainable from https://pytorch.org should also work.
We recommend installing Ninja
to make compilation faster.
Linux
For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions via
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
APEX also supports a Python-only build via
pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./
A Python-only build omits:
- Fused kernels required to use
apex.optimizers.FusedAdam
. - Fused kernels required to use
apex.normalization.FusedLayerNorm
andapex.normalization.FusedRMSNorm
. - Fused kernels that improve the performance and numerical stability of
apex.parallel.SyncBatchNorm
. - Fused kernels that improve the performance of
apex.parallel.DistributedDataParallel
andapex.amp
.DistributedDataParallel
,amp
, andSyncBatchNorm
will still be usable, but they may be slower.
[Experimental] Windows
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" .
may work if you were able to build Pytorch from source
on your system. A Python-only build via pip install -v --no-cache-dir .
is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.
Custom C++/CUDA Extensions and Install Options
If a requirement of a module is not met, then it will not be built.
Module Name | Install Option | Misc |
---|---|---|
apex_C | --cpp_ext | |
amp_C | --cuda_ext | |
syncbn | --cuda_ext | |
fused_layer_norm_cuda | --cuda_ext | apex.normalization |
mlp_cuda | --cuda_ext | |
scaled_upper_triang_masked_softmax_cuda | --cuda_ext | |
generic_scaled_masked_softmax_cuda | --cuda_ext | |
scaled_masked_softmax_cuda | --cuda_ext | |
fused_weight_gradient_mlp_cuda | --cuda_ext | Requires CUDA>=11 |
permutation_search_cuda | --permutation_search | apex.contrib.sparsity |
bnp | --bnp | apex.contrib.groupbn |
xentropy | --xentropy | apex.contrib.xentropy |
focal_loss_cuda | --focal_loss | apex.contrib.focal_loss |
fused_index_mul_2d | --index_mul_2d | apex.contrib.index_mul_2d |
fused_adam_cuda | --deprecated_fused_adam | apex.contrib.optimizers |
fused_lamb_cuda | --deprecated_fused_lamb | apex.contrib.optimizers |
fast_layer_norm | --fast_layer_norm | apex.contrib.layer_norm . different from fused_layer_norm |
fmhalib | --fmha | apex.contrib.fmha |
fast_multihead_attn | --fast_multihead_attn | apex.contrib.multihead_attn |
transducer_joint_cuda | --transducer | apex.contrib.transducer |
transducer_loss_cuda | --transducer | apex.contrib.transducer |
cudnn_gbn_lib | --cudnn_gbn | Requires cuDNN>=8.5, apex.contrib.cudnn_gbn |
peer_memory_cuda | --peer_memory | apex.contrib.peer_memory |
nccl_p2p_cuda | --nccl_p2p | Requires NCCL >= 2.10, apex.contrib.nccl_p2p |
fast_bottleneck | --fast_bottleneck | Requires peer_memory_cuda and nccl_p2p_cuda , apex.contrib.bottleneck |
fused_conv_bias_relu | --fused_conv_bias_relu | Requires cuDNN>=8.4, apex.contrib.conv_bias_relu |
Top Related Projects
PyTorch extensions for high performance and large scale training.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot