Convert Figma logo to code with AI

NVIDIA logoapex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

8,290
1,379
8,290
728

Top Related Projects

PyTorch extensions for high performance and large scale training.

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

14,221

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

82,049

Tensors and Dynamic neural networks in Python with strong GPU acceleration

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Quick Overview

NVIDIA/apex is a PyTorch extension that provides tools for mixed precision and distributed training. It aims to improve performance and memory efficiency in deep learning workflows, particularly for large-scale models and datasets.

Pros

  • Enables mixed precision training, which can significantly speed up computations and reduce memory usage
  • Provides optimized CUDA kernels for common operations, enhancing performance on NVIDIA GPUs
  • Offers easy-to-use distributed training utilities for multi-GPU and multi-node setups
  • Integrates seamlessly with PyTorch, allowing for minimal code changes in existing projects

Cons

  • Primarily focused on NVIDIA GPUs, limiting its usefulness for other hardware
  • Requires careful tuning and understanding of mixed precision concepts for optimal results
  • May introduce additional complexity to the training pipeline
  • Some features may not be compatible with the latest PyTorch versions immediately upon release

Code Examples

  1. Initializing mixed precision training:
from apex import amp

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
  1. Using distributed data parallel with Apex:
from apex.parallel import DistributedDataParallel as DDP

model = DDP(model)
  1. Applying gradient clipping with Apex:
from apex import amp

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
amp.master_params(optimizer).clip_grad_norm_(max_norm=1.0)
  1. Using Apex's optimized layer normalization:
from apex.normalization import FusedLayerNorm

layer_norm = FusedLayerNorm(normalized_shape)

Getting Started

To get started with NVIDIA/apex, follow these steps:

  1. Install Apex:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  1. Import and use Apex in your PyTorch code:
import torch
from apex import amp

# Define your model and optimizer
model = YourModel()
optimizer = torch.optim.Adam(model.parameters())

# Initialize mixed precision training
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# Train your model using Apex features
for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, target)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step()

Competitor Comparisons

PyTorch extensions for high performance and large scale training.

Pros of fairscale

  • More comprehensive distributed training support, including model parallelism and pipeline parallelism
  • Broader compatibility across different hardware platforms, not limited to NVIDIA GPUs
  • Active development and regular updates from Facebook AI Research team

Cons of fairscale

  • May have a steeper learning curve due to more advanced features
  • Potentially slower performance for some operations compared to Apex's CUDA-optimized implementations
  • Less focus on mixed-precision training compared to Apex

Code Comparison

Apex (Mixed Precision Training):

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

fairscale (Sharded Data Parallel):

model = ShardedDataParallel(model, sharded_optimizer=optimizer)
output = model(input)
loss = criterion(output, target)
loss.backward()

Both libraries aim to improve training efficiency, but fairscale offers a wider range of distributed training techniques, while Apex focuses more on mixed-precision training and NVIDIA-specific optimizations.

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • Offers a wider range of optimization techniques, including ZeRO, pipeline parallelism, and 1-bit Adam
  • Provides better memory efficiency, allowing training of larger models on limited hardware
  • Supports more flexible distributed training scenarios, including multi-node setups

Cons of DeepSpeed

  • Has a steeper learning curve due to its more complex API and configuration options
  • May require more setup and tuning to achieve optimal performance compared to Apex

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

DeepSpeed:

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args, model=model, model_parameters=model.parameters()
)
loss = model_engine(inputs)
model_engine.backward(loss)
model_engine.step()

Both libraries aim to optimize deep learning training, but DeepSpeed offers more advanced features and flexibility at the cost of increased complexity. Apex is simpler to use but may be limited in some scenarios. The choice between them depends on specific project requirements and hardware constraints.

14,221

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

  • Framework-agnostic: Works with TensorFlow, PyTorch, and MXNet
  • Supports distributed training across multiple GPUs and nodes
  • Easier to scale to large clusters and supercomputers

Cons of Horovod

  • Requires more setup and configuration compared to Apex
  • May have slightly higher overhead for single-node multi-GPU training
  • Less integrated with NVIDIA-specific optimizations

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Horovod:

hvd.init()
optimizer = hvd.DistributedOptimizer(optimizer)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
loss.backward()

Both libraries aim to improve distributed training performance, but Apex focuses on mixed precision training and NVIDIA GPU optimizations, while Horovod emphasizes scalability across different frameworks and distributed environments. Apex is more tightly integrated with PyTorch and NVIDIA hardware, offering easier setup for single-node multi-GPU scenarios. Horovod provides greater flexibility for large-scale distributed training across various frameworks and hardware configurations.

82,049

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

  • Broader ecosystem and community support
  • More comprehensive documentation and tutorials
  • Wider range of built-in features and functionalities

Cons of PyTorch

  • Slower performance for certain operations compared to Apex
  • Lacks some advanced mixed precision training features
  • May require more memory for large-scale models

Code Comparison

PyTorch:

import torch

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scaler = torch.cuda.amp.GradScaler()

for data, target in dataset:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast():
        output = model(data)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Apex:

import torch
from apex import amp

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

for data, target in dataset:
    optimizer.zero_grad()
    output = model(data)
    loss = loss_fn(output, target)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Pros of Accelerate

  • Easier to use and more beginner-friendly
  • Supports a wider range of hardware and platforms
  • Integrates seamlessly with Hugging Face ecosystem

Cons of Accelerate

  • May not offer the same level of performance optimization as Apex
  • Less fine-grained control over mixed precision training

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Accelerate:

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, training_dataloader = accelerator.prepare(
    model, optimizer, training_dataloader
)

Apex focuses on mixed precision training and optimization, while Accelerate provides a more general-purpose solution for distributed training and hardware acceleration. Apex offers more advanced features for performance tuning, but Accelerate is easier to integrate into existing projects and works across a broader range of hardware configurations.

Accelerate is designed to be more user-friendly and requires less code modification, making it a good choice for those new to distributed training or working with diverse hardware setups. Apex, on the other hand, may be preferred by users who need fine-grained control over mixed precision training and are willing to invest time in optimizing their models for NVIDIA GPUs.

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Pros of intel-extension-for-pytorch

  • Optimized for Intel hardware, including CPUs and GPUs
  • Supports a wider range of Intel-specific optimizations and features
  • Integrates seamlessly with Intel's oneAPI toolkit for enhanced performance

Cons of intel-extension-for-pytorch

  • Limited to Intel hardware, reducing flexibility for users with diverse hardware setups
  • May have a smaller community and fewer resources compared to Apex
  • Potentially slower adoption of new PyTorch features due to focus on Intel-specific optimizations

Code Comparison

apex:

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

intel-extension-for-pytorch:

import intel_extension_for_pytorch as ipex
model = ipex.optimize(model)

Both extensions aim to improve PyTorch performance, but they target different hardware ecosystems. Apex focuses on NVIDIA GPUs and provides mixed precision training, while intel-extension-for-pytorch optimizes for Intel hardware. The code snippets demonstrate the simplicity of integrating these extensions into existing PyTorch projects, with slight differences in syntax and functionality.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.

Full API Documentation: https://nvidia.github.io/apex

GTC 2019 and Pytorch DevCon 2019 Slides

Contents

1. Amp: Automatic Mixed Precision

Deprecated. Use PyTorch AMP

apex.amp is a tool to enable mixed precision training by changing only 3 lines of your script. Users can easily experiment with different pure and mixed precision training modes by supplying different flags to amp.initialize.

Webinar introducing Amp (The flag cast_batchnorm has been renamed to keep_batchnorm_fp32).

API Documentation

Comprehensive Imagenet example

DCGAN example coming soon...

Moving to the new Amp API (for users of the deprecated "Amp" and "FP16_Optimizer" APIs)

2. Distributed Training

apex.parallel.DistributedDataParallel is deprecated. Use torch.nn.parallel.DistributedDataParallel

apex.parallel.DistributedDataParallel is a module wrapper, similar to torch.nn.parallel.DistributedDataParallel. It enables convenient multiprocess distributed training, optimized for NVIDIA's NCCL communication library.

API Documentation

Python Source

Example/Walkthrough

The Imagenet example shows use of apex.parallel.DistributedDataParallel along with apex.amp.

Synchronized Batch Normalization

Deprecated. Use torch.nn.SyncBatchNorm

apex.parallel.SyncBatchNorm extends torch.nn.modules.batchnorm._BatchNorm to support synchronized BN. It allreduces stats across processes during multiprocess (DistributedDataParallel) training. Synchronous BN has been used in cases where only a small local minibatch can fit on each GPU. Allreduced stats increase the effective batch size for the BN layer to the global batch size across all processes (which, technically, is the correct formulation). Synchronous BN has been observed to improve converged accuracy in some of our research models.

Checkpointing

To properly save and load your amp training, we introduce the amp.state_dict(), which contains all loss_scalers and their corresponding unskipped steps, as well as amp.load_state_dict() to restore these attributes.

In order to get bitwise accuracy, we recommend the following workflow:

# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

# Train your model
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
...

# Save checkpoint
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...

# Restore
model = ...
optimizer = ...
checkpoint = torch.load('amp_checkpoint.pt')

model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
amp.load_state_dict(checkpoint['amp'])

# Continue training
...

Note that we recommend restoring the model using the same opt_level. Also note that we recommend calling the load_state_dict methods after amp.initialize.

Installation

Each apex.contrib module requires one or more install options other than --cpp_ext and --cuda_ext. Note that contrib modules do not necessarily support stable PyTorch releases.

Containers

NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.

See the NGC documentation for details such as:

  • how to pull a container
  • how to run a pulled container
  • release notes

From Source

To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.

The latest stable release obtainable from https://pytorch.org should also work.

We recommend installing Ninja to make compilation faster.

Linux

For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions via

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

APEX also supports a Python-only build via

pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

A Python-only build omits:

  • Fused kernels required to use apex.optimizers.FusedAdam.
  • Fused kernels required to use apex.normalization.FusedLayerNorm and apex.normalization.FusedRMSNorm.
  • Fused kernels that improve the performance and numerical stability of apex.parallel.SyncBatchNorm.
  • Fused kernels that improve the performance of apex.parallel.DistributedDataParallel and apex.amp. DistributedDataParallel, amp, and SyncBatchNorm will still be usable, but they may be slower.

[Experimental] Windows

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" . may work if you were able to build Pytorch from source on your system. A Python-only build via pip install -v --no-cache-dir . is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

Custom C++/CUDA Extensions and Install Options

If a requirement of a module is not met, then it will not be built.

Module NameInstall OptionMisc
apex_C--cpp_ext
amp_C--cuda_ext
syncbn--cuda_ext
fused_layer_norm_cuda--cuda_extapex.normalization
mlp_cuda--cuda_ext
scaled_upper_triang_masked_softmax_cuda--cuda_ext
generic_scaled_masked_softmax_cuda--cuda_ext
scaled_masked_softmax_cuda--cuda_ext
fused_weight_gradient_mlp_cuda--cuda_extRequires CUDA>=11
permutation_search_cuda--permutation_searchapex.contrib.sparsity
bnp--bnpapex.contrib.groupbn
xentropy--xentropyapex.contrib.xentropy
focal_loss_cuda--focal_lossapex.contrib.focal_loss
fused_index_mul_2d--index_mul_2dapex.contrib.index_mul_2d
fused_adam_cuda--deprecated_fused_adamapex.contrib.optimizers
fused_lamb_cuda--deprecated_fused_lambapex.contrib.optimizers
fast_layer_norm--fast_layer_normapex.contrib.layer_norm. different from fused_layer_norm
fmhalib--fmhaapex.contrib.fmha
fast_multihead_attn--fast_multihead_attnapex.contrib.multihead_attn
transducer_joint_cuda--transducerapex.contrib.transducer
transducer_loss_cuda--transducerapex.contrib.transducer
cudnn_gbn_lib--cudnn_gbnRequires cuDNN>=8.5, apex.contrib.cudnn_gbn
peer_memory_cuda--peer_memoryapex.contrib.peer_memory
nccl_p2p_cuda--nccl_p2pRequires NCCL >= 2.10, apex.contrib.nccl_p2p
fast_bottleneck--fast_bottleneckRequires peer_memory_cuda and nccl_p2p_cuda, apex.contrib.bottleneck
fused_conv_bias_relu--fused_conv_bias_reluRequires cuDNN>=8.4, apex.contrib.conv_bias_relu