DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

39,112

4,441

39,112

1,175

View on GitHub

Top Related Projects

horovod

14,454

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Megatron-LM

12,200

Ongoing research training transformer models at scale

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Quick Overview

DeepSpeed is an open-source deep learning optimization library developed by Microsoft. It's designed to make distributed training easy, efficient, and effective, enabling training of large models with billions of parameters on limited hardware resources. DeepSpeed offers various optimization techniques and tools to accelerate AI model training and inference.

Pros

Enables training of extremely large models (100+ billion parameters) on limited hardware
Provides significant speedup and memory reduction in model training
Offers a wide range of optimization techniques, including ZeRO, 3D parallelism, and pipeline parallelism
Integrates seamlessly with popular deep learning frameworks like PyTorch

Cons

Steep learning curve for beginners due to its advanced features
Requires careful configuration and tuning to achieve optimal performance
Some features may not be compatible with all model architectures
Documentation can be complex and overwhelming for new users

Code Examples

Basic DeepSpeed configuration:

import deepspeed

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters()
)

Using ZeRO Stage 3 for memory optimization:

config = {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
    }
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=config,
    model_parameters=model.parameters()
)

Pipeline parallelism with DeepSpeed:

from deepspeed.pipe import PipelineModule

class ExamplePipelineModule(PipelineModule):
    def __init__(self, num_stages):
        super().__init__(layers=[
            torch.nn.Linear(10, 10) for _ in range(num_stages)
        ], num_stages=num_stages)

model = ExamplePipelineModule(num_stages=4)
engine = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())

Getting Started

To get started with DeepSpeed:

Install DeepSpeed:

pip install deepspeed

Create a DeepSpeed configuration file (e.g., ds_config.json):

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "betas": [0.8, 0.999],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 1000
    }
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  }
}

Modify your training script to use DeepSpeed:

import deepspeed

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

for step, batch in enumerate(data_loader):
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

Competitor Comparisons

horovod

14,454

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

Simpler API and easier integration with existing TensorFlow, PyTorch, and MXNet code
Better support for multi-GPU and multi-node training across different frameworks
More mature and battle-tested in production environments

Cons of Horovod

Less optimized for large language models and extreme-scale training
Fewer advanced features for memory optimization and pipeline parallelism
Limited support for mixed precision training compared to DeepSpeed

Code Comparison

Horovod:

import horovod.tensorflow as hvd
hvd.init()
optimizer = tf.optimizers.Adam(0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args, model=model, model_parameters=params)

Both libraries aim to simplify distributed deep learning training, but they have different focuses. Horovod provides a more straightforward approach for distributed training across various frameworks, while DeepSpeed offers more advanced optimizations for large-scale models and extreme-scale training scenarios. The choice between them depends on the specific requirements of your project, such as model size, training scale, and desired optimizations.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

More focused on sequence-to-sequence tasks and natural language processing
Extensive collection of pre-trained models and benchmarks
Easier to use for researchers in NLP and machine translation

Cons of fairseq

Less optimized for large-scale distributed training
More limited in terms of general deep learning applications
Steeper learning curve for users not familiar with NLP tasks

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel
en2de = TransformerModel.from_pretrained(
    '/path/to/checkpoints',
    checkpoint_file='checkpoint_best.pt',
    data_name_or_path='data-bin/wmt16_en_de_bpe32k'
)

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=params
)

Summary

While fairseq excels in NLP tasks and offers a rich set of pre-trained models, DeepSpeed provides more general-purpose optimization for large-scale deep learning. fairseq is ideal for researchers in machine translation and NLP, whereas DeepSpeed is better suited for those seeking to scale up their deep learning models across various domains.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

Focused on natural language processing tasks with pre-built models and datasets
Extensive documentation and tutorials for easier onboarding
Modular architecture allowing for easy customization of components

Cons of AllenNLP

Limited to NLP tasks, not as versatile for general deep learning
May have slower training speeds for large-scale models
Less emphasis on distributed training and optimization techniques

Code Comparison

AllenNLP:

from allennlp.predictors import Predictor

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz")
result = predictor.predict(sentence="Did Uriah honestly think he could beat the game in under three hours?")

DeepSpeed:

import deepspeed
import torch

model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
                                                     model=model,
                                                     model_parameters=params)
outputs = model_engine(inputs)
loss = criterion(outputs, labels)
model_engine.backward(loss)
model_engine.step()

The code snippets demonstrate the different focus areas of each library. AllenNLP provides high-level APIs for NLP tasks, while DeepSpeed offers low-level optimization and distributed training capabilities for general deep learning models.

Megatron-LM

12,200

Ongoing research training transformer models at scale

Pros of Megatron-LM

Specialized for training large language models (LLMs)
Optimized for NVIDIA GPUs and hardware
Includes pre-built model architectures for common LLM tasks

Cons of Megatron-LM

Less flexible for non-LLM tasks
Limited to NVIDIA hardware ecosystems
Steeper learning curve for researchers new to LLMs

Code Comparison

Megatron-LM:

model = get_language_model(args)
model = wrap_with_ddp(model, args.device_id, args.use_distributed)
optimizer = get_megatron_optimizer(model, args)

DeepSpeed:

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters()
)

Megatron-LM focuses on providing pre-built model architectures and optimizations for LLMs, while DeepSpeed offers a more general-purpose distributed training framework. Megatron-LM's code is more specific to language model training, whereas DeepSpeed's initialization is more flexible and can be applied to various model types.

DeepSpeed provides a wider range of optimization techniques and is more adaptable to different hardware setups, making it suitable for a broader range of deep learning tasks. However, Megatron-LM's specialization in LLMs can offer performance advantages for specific language model training scenarios on NVIDIA hardware.

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Extensive library of pre-trained models for various NLP tasks
User-friendly API with high-level abstractions for easy model usage
Active community and frequent updates with state-of-the-art models

Cons of Transformers

Less focus on distributed training and optimization techniques
May require additional libraries for advanced performance tuning
Can be memory-intensive for large models without optimization

Code Comparison

Transformers:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

DeepSpeed:

import deepspeed
import torch

model = MyModel()
engine = deepspeed.initialize(model=model, config_params=ds_config)

DeepSpeed focuses on optimizing training performance, while Transformers provides a higher-level interface for working with pre-trained models. DeepSpeed offers more advanced distributed training capabilities, whereas Transformers excels in ease of use and model variety. Transformers is ideal for quick prototyping and fine-tuning, while DeepSpeed is better suited for large-scale training and optimization of custom models.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Latest News

DeepSpeed empowers ChatGPT-like model training with a single click, offering 15x speedup over SOTA RLHF systems with unprecedented cost reduction at all scales; learn how.

[2025/06] Arctic Long Sequence Training (ALST) with DeepSpeed: Scalable And Efficient Training For Multi-Million Token Sequences
[2025/04] DeepCompile: Unlocking Compiler Optimization for Distributed Training
[2025/03] DeepSpeed-AutoTP: Automatic Tensor Parallel Training of Hugging Face models
[2024/12] Ulysses-Offload: Democratizing Long Context LLM Training
[2024/12] DeepSpeed-Domino: Communication-Free LLM Training Engine
[2024/08] DeepSpeed on Windows [æ¥æ¬èª] [ä¸æ]

More news

[2024/08] DeepNVMe: Improving DL Applications through I/O Optimizations [ æ¥æ¬èª ] [ ä¸æ ]

<li> [2024/07] <a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ucp/README.md"> DeepSpeed Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training</a> [<a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ucp/japanese/README.md"> æ¥æ¬èª </a>] </li>

[2024/03] DeepSpeed-FP6: The Power of FP6-Centric Serving for Large Language Models [ ä¸æ ]

Extreme Speed and Scale for DL Training and Inference

DeepSpeed enabled the world's most powerful language models (at the time of this writing) such as MT-530B and BLOOM. It is an easy-to-use deep learning optimization software suite that powers unprecedented scale and speed for both training and inference. With DeepSpeed you can:

Train/Inference dense or sparse models with billions or trillions of parameters
Achieve excellent system throughput and efficiently scale to thousands of GPUs
Train/Inference on resource constrained GPU systems
Achieve unprecedented low latency and high throughput for inference
Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

DeepSpeed's four innovation pillars

DeepSpeed-Training

DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: DeepSpeed-Training

DeepSpeed-Inference

DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, throughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: DeepSpeed-Inference

DeepSpeed-Compression

To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: DeepSpeed-Compression

DeepSpeed4Science

In line with Microsoft's mission to solve humanity's most pressing challenges, the DeepSpeed team at Microsoft is responding to this opportunity by launching a new initiative called DeepSpeed4Science, aiming to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. Learn more: tutorials

DeepSpeed Software Suite

DeepSpeed Library

The DeepSpeed library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, inference or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see DeepSpeed Adoption).

Model Implementations for Inference (MII)

Model Implementations for Inference (MII) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.

DeepSpeed on Azure

DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML recipes. The job submission and data preparation scripts have been made available here. For more details on how to use DeepSpeed on Azure, please follow the Azure tutorial.

DeepSpeed Adoption

DeepSpeed was an important part of Microsoftâs AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):

DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

	Documentation
	Transformers with DeepSpeed
	Accelerate with DeepSpeed
	Lightning with DeepSpeed
	MosaicML with DeepSpeed
	Determined with DeepSpeed
	MMEngine with DeepSpeed

Build Pipeline Status

Description	Status
NVIDIA
AMD
CPU
Intel Gaudi
Intel XPU
PyTorch Nightly
Integrations
Misc
Huawei Ascend NPU

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Requirements

PyTorch must be installed before installing DeepSpeed.
For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
- NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
- AMD: MI100 and MI200

Contributed HW support

DeepSpeed now support various HW accelerators.

Contributor	Hardware	Accelerator Name	Contributor validated	Upstream validated
Huawei	Huawei Ascend NPU	npu	Yes	No
Intel	Intel(R) Gaudi(R) 2 AI accelerator	hpu	Yes	Yes
Intel	Intel(R) Xeon(R) Processors	cpu	Yes	Yes
Intel	Intel(R) Data Center GPU Max series	xpu	Yes	Yes
Tecorigin	Scalable Data Analytics Accelerator	sdaa	Yes	No

PyPI

We regularly push releases to PyPI and encourage users to install from there in most cases.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

Windows

Many DeepSpeed features are supported on Windows for both training and inference. You can read more about this in the original blog post here. Among features that are currently not supported are async io (AIO) and GDS (which does not support Windows).

Install PyTorch, such as pytorch 2.3+cu121.
Install Visual C++ build tools, such as VS2022 C++ x64/x86 build tools.
Launch Cmd console with Administrator permissions for creating required symlink folders and ensure MSVC tools are added to your PATH or launch the Developer Command Prompt for Visual Studio 2022 with administrator permissions.
Run build_win.bat to build wheel in dist folder.

Features

Please checkout DeepSpeed-Training, DeepSpeed-Inference and DeepSpeed-Compression pages for full set of features offered along each of these three pillars.

	Description
Getting Started	First steps with DeepSpeed
DeepSpeed JSON Configuration	Configuring DeepSpeed
API Documentation	Generated DeepSpeed API documentation
Tutorials	Tutorials
Blogs	Blogs

Contributing

DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.
Thanks so much to all of our amazing contributors!

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Publications

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 and USENIX ATC 2021. [paper] [slides] [blog]
Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888 and ICML 2021.
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857 and SC 2021. [paper] [slides] [blog]
Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069 and HiPC 2022.
Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. arXiv:2108.06084 and NeurIPS 2022.
Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv:2201.05596 and ICML 2022. [pdf] [slides] [blog]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model arXiv:2201.11990.
Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv:2206.01859 and NeurIPS 2022.
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv:2206.01861 and NeurIPS 2022 [slides] [blog]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 and SC 2022. [paper] [slides] [blog]
Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. arXiv:2211.11586.
Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597 ENLSP2023 Workshop at NeurIPS2023
Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv:2301.12017 and ICML2023.
Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. ICLR:2023.
Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 and Finding at EMNLP2023.
Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning arXiv:2303.08374 and will appear at IPDPS 2023.
Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele. (2023) A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training arXiv:2303.06318 and ICS 2023.
Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He. (2023) ZeRO++: Extremely Efficient Collective Communication for Giant Model Training arXiv:2306.10209 and ML for Sys Workshop at NeurIPS2023 [blog]
Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He. (2023) ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation arXiv:2303.08302 and ENLSP2023 Workshop at NeurIPS2023 [slides]
Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He. (2023) Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important? arXiv:2305.09847
Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He. (2023) DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales arXiv:2308.01320.
Xiaoxia Wu, Zhewei Yao, Yuxiong He. (2023) ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats arXiv:2307.09782 and ENLSP2023 Workshop at NeurIPS2023 [slides]
Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He. (2023) DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention arXiv:2309.14327
Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, et al. (2023) DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies arXiv:2310.04610 [blog]
Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He. (2023) ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers arXiv:2310.17723
Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao (2023) ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks arXiv:2312.08583
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. (2024) FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design arXiv:2401.14112
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. (2024) System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang. (2024) Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training arXiv:2406.18820
Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He. (2025) Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences arXiv:2506.13996

Videos

DeepSpeed KDD 2020 Tutorial
1. Overview
2. ZeRO + large model training
3. 17B T-NLG demo
4. Fastest BERT training + RScan tuning
5. DeepSpeed hands on deep dive: part 1, part 2, part 3
6. FAQ
Microsoft Research Webinar
- Registration is free and all videos are available on-demand.
- ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed.
DeepSpeed on AzureML
Large Model Training and Inference with DeepSpeed // Samyam Rajbhandari // LLMs in Prod Conference [slides]
Community Tutorials

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Horovod

Cons of Horovod

Code Comparison

Pros of fairseq

Cons of fairseq

Code Comparison

Summary

Pros of AllenNLP

Cons of AllenNLP

Code Comparison

Pros of Megatron-LM

Cons of Megatron-LM

Code Comparison

Pros of Transformers

Cons of Transformers

Code Comparison

Convert designs to code with AI

README

Latest News

Extreme Speed and Scale for DL Training and Inference

DeepSpeed's four innovation pillars

DeepSpeed-Training

DeepSpeed-Inference

DeepSpeed-Compression

DeepSpeed4Science

DeepSpeed Software Suite

DeepSpeed Library

Model Implementations for Inference (MII)

DeepSpeed on Azure

DeepSpeed Adoption

Build Pipeline Status

Installation

Requirements

Contributed HW support

PyPI

Windows

Features

Further Reading

Contributing

Contributor License Agreement

Code of Conduct

Publications

Videos

Top Related Projects

Convert designs to code with AI