DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

2,007

181

2,007

207

View on GitHub

Top Related Projects

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Megatron-LM

12,200

Ongoing research training transformer models at scale

allennlp

11,843

An open-source NLP research library, built on PyTorch.

gpt-neox

7,168

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Quick Overview

DeepSpeed-MII (Model Implementations for Inference) is a library that provides optimized implementations of popular AI models for inference. It aims to simplify the deployment of large language models and other AI models while offering high performance and efficiency.

Pros

Optimized for inference, providing high-performance implementations of popular models
Supports various model architectures and sizes, including large language models
Integrates seamlessly with the DeepSpeed ecosystem for enhanced performance
Offers easy-to-use APIs for quick deployment and integration

Cons

Limited to specific model architectures and may not support all custom models
Requires familiarity with the DeepSpeed ecosystem for optimal usage
Documentation may be less comprehensive compared to more established libraries
Potential learning curve for users new to DeepSpeed or optimized inference

Code Examples

Loading and using a pre-trained model:

from deepspeed_mii import MIIModel

model = MIIModel.from_pretrained("microsoft/bloom-560m")
output = model.generate("Once upon a time,", max_new_tokens=50)
print(output)

Configuring inference parameters:

from deepspeed_mii import MIIConfig, MIIModel

config = MIIConfig(
    tensor_parallel=2,
    dtype="float16",
    max_tokens=1024
)
model = MIIModel.from_pretrained("microsoft/bloom-1b7", config=config)

Using the model for text classification:

from deepspeed_mii import MIIModel

model = MIIModel.from_pretrained("microsoft/deberta-v3-base")
input_text = "This movie was fantastic!"
output = model(input_text, task="text-classification")
print(output)

Getting Started

To get started with DeepSpeed-MII, follow these steps:

Install the library:

pip install deepspeed-mii

Import and use a pre-trained model:

from deepspeed_mii import MIIModel

model = MIIModel.from_pretrained("microsoft/bloom-560m")
output = model.generate("Hello, world!", max_new_tokens=20)
print(output)

For more advanced usage and configuration options, refer to the official documentation and examples in the GitHub repository.

Competitor Comparisons

transformers

146,142

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Extensive model support: Offers a wide range of pre-trained models and architectures
Active community: Large user base and frequent updates
Comprehensive documentation: Detailed guides and examples for various use cases

Cons of Transformers

Performance: May not be as optimized for large-scale inference as DeepSpeed-MII
Complexity: Can be overwhelming for beginners due to its extensive features

Code Comparison

Transformers:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

DeepSpeed-MII:

from mii import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")

Both repositories provide similar high-level APIs for inference tasks. However, DeepSpeed-MII focuses on optimizing large-scale inference performance, while Transformers offers a broader range of models and features for various NLP tasks. DeepSpeed-MII may be more suitable for production deployments requiring high throughput, while Transformers is versatile for research and development across multiple NLP domains.

fairseq

31,373

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

More comprehensive and feature-rich, supporting a wide range of NLP tasks and models
Longer development history and larger community, resulting in more extensive documentation and examples
Offers a broader set of pre-trained models and datasets

Cons of fairseq

Steeper learning curve due to its extensive feature set
May be overkill for simpler projects or specific use cases
Less focus on optimization and efficiency compared to DeepSpeed-MII

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model', 'checkpoint.pt')
tokens = model.encode('Hello world!')
output = model.decode(tokens)

DeepSpeed-MII:

from mii import pipeline
generator = pipeline("text-generation", model="gpt2")
output = generator("Hello world!", max_length=50)

Summary

fairseq is a more comprehensive toolkit for NLP tasks, offering a wide range of models and features. It's well-suited for research and complex projects but may be overwhelming for simpler use cases. DeepSpeed-MII, on the other hand, focuses on efficient inference and ease of use, making it more suitable for production environments and specific optimization needs.

Megatron-LM

12,200

Ongoing research training transformer models at scale

Pros of Megatron-LM

Optimized for NVIDIA GPUs, offering excellent performance on supported hardware
Supports a wider range of model architectures, including GPT, BERT, and T5
More mature project with extensive documentation and community support

Cons of Megatron-LM

Limited to NVIDIA hardware, reducing flexibility for users with different setups
Steeper learning curve due to its more complex architecture and configuration options

Code Comparison

Megatron-LM (model initialization):

model = get_model(
    model_provider=model_provider,
    wrap_with_ddp=True,
    virtual_pipeline_model_parallel_size=args.virtual_pipeline_model_parallel_size,
    cpu_offload=args.cpu_offload)

DeepSpeed-MII (model initialization):

model = mii.create(
    task='text-generation',
    model_name_or_path='gpt2',
    deployment_name='my-gpt2-deployment')

DeepSpeed-MII offers a more straightforward API for model initialization and deployment, while Megatron-LM provides more fine-grained control over model configuration and parallelism strategies. DeepSpeed-MII is designed for ease of use and quick deployment, whereas Megatron-LM focuses on performance optimization and scalability for large language models on NVIDIA hardware.

allennlp

11,843

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

Comprehensive library for NLP research with a wide range of pre-built models and tools
Extensive documentation and tutorials, making it accessible for researchers and practitioners
Strong community support and regular updates

Cons of AllenNLP

Steeper learning curve compared to DeepSpeed-MII's more focused approach
May have more overhead for simple tasks or when fine-tuning is the primary goal

Code Comparison

AllenNLP example:

from allennlp.predictors import Predictor

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz")
result = predictor.predict(sentence="Did Uriah honestly think he could beat the game in under three hours?")

DeepSpeed-MII example:

from mii import Pipeline

pipeline = Pipeline("text-generation")
output = pipeline("Once upon a time", max_length=50)

Summary

AllenNLP offers a comprehensive NLP toolkit with extensive features and community support, ideal for research and complex NLP tasks. DeepSpeed-MII focuses on efficient inference and fine-tuning, providing a more streamlined approach for specific use cases. The choice between them depends on the project's requirements, complexity, and the user's familiarity with NLP concepts.

gpt-neox

7,168

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Pros of gpt-neox

Focused on training large language models, particularly suited for GPT-style architectures
Offers more flexibility in model architecture and training configurations
Provides comprehensive documentation and examples for training and fine-tuning

Cons of gpt-neox

Requires more setup and configuration compared to DeepSpeed-MII's streamlined approach
May have a steeper learning curve for users new to large-scale language model training
Less emphasis on inference optimization compared to DeepSpeed-MII

Code Comparison

gpt-neox:

from megatron.neox_arguments import NeoXArgs
from megatron.global_vars import set_global_variables, get_tokenizer
from megatron.training import pretrain

args = NeoXArgs.from_ymls("configs/your_config.yml")
set_global_variables(args)

DeepSpeed-MII:

from mii import pipeline
model = pipeline("text-generation", model="EleutherAI/gpt-j-6B")
output = model("Once upon a time", max_length=50)
print(output)

gpt-neox focuses on training and offers more control over the process, while DeepSpeed-MII emphasizes ease of use for inference tasks. gpt-neox requires more setup but provides greater flexibility, whereas DeepSpeed-MII offers a simpler API for quick deployment of pre-trained models.

t5x

2,794

Pros of t5x

Focused on T5 model architecture, offering specialized tools and optimizations
Extensive documentation and examples for T5-specific tasks
Integrated with JAX and Flax for efficient training on TPUs

Cons of t5x

Limited to T5 models, less versatile for other architectures
Steeper learning curve for those unfamiliar with JAX/Flax ecosystem
Less emphasis on deployment and inference optimization

Code Comparison

t5x example:

import jax
from t5x import models
from t5x import utils

model = models.EncoderDecoderModel(...)
trainer = utils.Trainer(model=model, ...)
trainer.train(...)

DeepSpeed-MII example:

from mii import pipeline
model = pipeline("text-generation", model="gpt2")
output = model("Hello, how are you?")
print(output)

Key Differences

t5x is tailored for T5 models and research, while DeepSpeed-MII offers a more general-purpose solution
DeepSpeed-MII focuses on easy deployment and inference, whereas t5x emphasizes training and experimentation
t5x leverages JAX/Flax ecosystem, while DeepSpeed-MII builds on PyTorch and Microsoft's DeepSpeed library

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

DeepSpeed Model Implementations for Inference (MII)

Introducing MII, an open-source Python library designed by DeepSpeed to democratize powerful model inference with a focus on high-throughput, low latency, and cost-effectiveness.

MII features include blocked KV-caching, continuous batching, Dynamic SplitFuse, tensor parallelism, and high-performance CUDA kernels to support fast high throughput text-generation for LLMs such as Llama-2-70B, Mixtral (MoE) 8x7B, and Phi-2. The latest updates in v0.2 add new model families, performance optimizations, and feature enhancements. MII now delivers up to 2.5 times higher effective throughput compared to leading systems such as vLLM. For detailed performance results please see our latest DeepSpeed-FastGen blog and DeepSpeed-FastGen release blog.

We first announced MII in 2022, which covers all prior releases up to v0.0.9. In addition to language models, we also support accelerating text2image models like Stable Diffusion. For more details on our previous releases please see our legacy APIs.

Key Technologies

MII for High-Throughput Text Generation

MII provides accelerated text-generation inference through the use of four key technologies:

Blocked KV Caching
Continuous Batching
Dynamic SplitFuse
High Performance CUDA Kernels

For a deeper dive into understanding these features please refer to our blog which also includes a detailed performance analysis.

MII Legacy

In the past, MII introduced several key performance optimizations for low-latency serving scenarios:

DeepFusion for Transformers
Multi-GPU Inference with Tensor-Slicing
ZeRO-Inference for Resource Constrained Systems
Compiler Optimizations

How does MII work?

Figure 1: MII architecture, showing how MII automatically optimizes OSS models using DS-Inference before deploying them. DeepSpeed-FastGen optimizations in the figure have been published in our blog post.

Under-the-hood MII is powered by DeepSpeed-Inference. Based on the model architecture, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of system optimizations to minimize latency and maximize throughput.

Supported Models

MII currently supports over 37,000 models across eight popular model architectures. We plan to add additional models in the near term, if there are specific model architectures you would like supported please file an issue and let us know. All current models leverage Hugging Face in our backend to provide both the model weights and the model's corresponding tokenizer. For our current release we support the following model architectures:

model family	size range	~model count
Falcon	7B - 180B	600
Llama	7B - 65B	57,000
Llama-2	7B - 70B	1,200
Llama-3	8B - 405B	1,600
Mistral	7B	23,000
Mixtral (MoE)	8x7B	2,900
OPT	0.1B - 66B	2,200
Phi-2	2.7B	1,500
Qwen	7B - 72B	500
Qwen2	0.5B - 72B	3700

MII Legacy Model Support

MII Legacy APIs support over 50,000 different models including BERT, RoBERTa, Stable Diffusion, and other text-generation models like Bloom, GPT-J, etc. For a full list please see our legacy supported models table.

Getting Started with MII

DeepSpeed-MII allows users to create non-persistent and persistent deployments for supported models in just a few lines of code.

Installation
Non-Persistent Pipeline
Persistent Deployment

Installation

The fasest way to get started is with our PyPI release of DeepSpeed-MII which means you can get started within minutes via:

pip install deepspeed-mii

For ease of use and significant reduction in lengthy compile times that many projects require in this space we distribute a pre-compiled python wheel covering the majority of our custom kernels through a new library called DeepSpeed-Kernels. We have found this library to be very portable across environments with NVIDIA GPUs with compute capabilities 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+. In most cases you shouldn't even need to know this library exists as it is a dependency of DeepSpeed-MII and will be installed with it. However, if for whatever reason you need to compile our kernels manually please see our advanced installation docs.

Non-Persistent Pipeline

A non-persistent pipeline is a great way to try DeepSpeed-MII. Non-persistent pipelines are only around for the duration of the python script you are running. The full example for running a non-persistent pipeline deployment is only 4 lines. Give it a try!

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)

The returned response is a list of Response objects. We can access several details about the generation (e.g., response[0].prompt_length):

generated_text: str Text generated by the model.
prompt_length: int Number of tokens in the original prompt.
generated_length: int Number of tokens generated.
finish_reason: str Reason for stopping generation. stop indicates the EOS token was generated and length indicates the generation reached max_new_tokens or max_length.

If you want to free device memory and destroy the pipeline, use the destroy method:

pipe.destroy()

Tensor parallelism

Taking advantage of multi-GPU systems for greater performance is easy with MII. When run with the deepspeed launcher, tensor parallelism is automatically controlled by the --num_gpus flag:

# Run on a single GPU
deepspeed --num_gpus 1 mii-example.py

# Run on multiple GPUs
deepspeed --num_gpus 2 mii-example.py

Pipeline Options

While only the model name or path is required to stand up a non-persistent pipeline deployment, we offer customization options to our users:

mii.pipeline() Options:

model_name_or_path: str Name or local path to a HuggingFace model.
max_length: int Sets the default maximum token length for the prompt + response.
all_rank_output: bool When enabled, all ranks return the generated text. By default, only rank 0 will return text.

Users can also control the generation characteristics for individual prompts (i.e., when calling pipe()) with the following options:

max_length: int Sets the per-prompt maximum token length for prompt + response.
min_new_tokens: int Sets the minimum number of tokens generated in the response. max_length will take precedence over this setting.
max_new_tokens: int Sets the maximum number of tokens generated in the response.
ignore_eos: bool (Defaults to False) Setting to True prevents generation from ending when the EOS token is encountered.
top_p: float (Defaults to 0.9) When set below 1.0, filter tokens and keep only the most probable, where token probabilities sum to ≥top_p.
top_k: int (Defaults to None) When None, top-k filtering is disabled. When set, the number of highest probability tokens to keep.
temperature: float (Defaults to None) When None, temperature is disabled. When set, modulates token probabilities.
do_sample: bool (Defaults to True) When True, sample output logits. When False, use greedy sampling.
return_full_text: bool (Defaults to False) When True, prepends the input prompt to the returned text

Persistent Deployment

A persistent deployment is ideal for use with long-running and production applications. The persistent model uses a lightweight GRPC server that can be queried by multiple clients at once. The full example for running a persistent model is only 5 lines. Give it a try!

import mii
client = mii.serve("mistralai/Mistral-7B-v0.1")
response = client.generate(["Deepspeed is", "Seattle is"], max_new_tokens=128)
print(response)

The returned response is a list of Response objects. We can access several details about the generation (e.g., response[0].prompt_length):

generated_text: str Text generated by the model.
prompt_length: int Number of tokens in the original prompt.
generated_length: int Number of tokens generated.
finish_reason: str Reason for stopping generation. stop indicates the EOS token was generated and length indicates the generation reached max_new_tokens or max_length.

If we want to generate text from other processes, we can do that too:

client = mii.client("mistralai/Mistral-7B-v0.1")
response = client.generate("Deepspeed is", max_new_tokens=128)

When we no longer need a persistent deployment, we can shutdown the server from any client:

client.terminate_server()

Model Parallelism

Taking advantage of multi-GPU systems for better latency and throughput is also easy with the persistent deployments. Model parallelism is controlled by the tensor_parallel input to mii.serve:

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)

The resulting deployment will split the model across 2 GPUs to deliver faster inference and higher throughput than a single GPU.

Model Replicas

We can also take advantage of multi-GPU (and multi-node) systems by setting up multiple model replicas and taking advantage of the load-balancing that DeepSpeed-MII provides:

client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2)

The resulting deployment will load 2 model replicas (one per GPU) and load-balance incoming requests between the 2 model instances.

Model parallelism and replicas can also be combined to take advantage of systems with many more GPUs. In the example below, we run 2 model replicas, each split across 2 GPUs on a system with 4 GPUs:

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2, replica_num=2)

The choice between model parallelism and model replicas for maximum performance will depend on the nature of the hardware, model, and workload. For example, with small models users may find that model replicas provide the lowest average latency for requests. Meanwhile, large models may achieve greater overall throughput when using only model parallelism.

RESTful API

MII makes it easy to setup and run model inference via RESTful APIs by setting enable_restful_api=True when creating a persistent MII deployment. The RESTful API can receive requests at http://{HOST}:{RESTFUL_API_PORT}/mii/{DEPLOYMENT_NAME}. A full example is provided below:

client = mii.serve(
    "mistralai/Mistral-7B-v0.1",
    deployment_name="mistral-deployment",
    enable_restful_api=True,
    restful_api_port=28080,
)

ð Note: While providing a deployment_name is not necessary (MII will autogenerate one for you), it is good practice to provide a deployment_name so that you can ensure you are interfacing with the correct RESTful API.

You can then send prompts to the RESTful gateway with any HTTP client, such as curl:

curl --header "Content-Type: application/json" --request POST  -d '{"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}' http://localhost:28080/mii/mistral-deployment

or python:

import json
import requests
url = f"http://localhost:28080/mii/mistral-deployment"
params = {"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

Persistent Deployment Options

While only the model name or path is required to stand up a persistent deployment, we offer customization options to our users.

mii.serve() Options:

model_name_or_path: str (Required) Name or local path to a HuggingFace model.
max_length: int (Defaults to maximum sequence length in model config) Sets the default maximum token length for the prompt + response.
deployment_name: str (Defaults to f"{model_name_or_path}-mii-deployment") A unique identifying string for the persistent model. If provided, client objects should be retrieved with client = mii.client(deployment_name).
tensor_parallel: int (Defaults to 1) Number of GPUs to split the model across.
replica_num: int (Defaults to 1) The number of model replicas to stand up.
enable_restful_api: bool (Defaults to False) When enabled, a RESTful API gateway process is launched that can be queried at http://{host}:{restful_api_port}/mii/{deployment_name}. See the section on RESTful APIs for more details.
restful_api_port: int (Defaults to 28080) The port number used to interface with the RESTful API when enable_restful_api is set to True.

mii.client() Options:

model_or_deployment_name: str Name of the model or deployment_name passed to mii.serve()

Users can also control the generation characteristics for individual prompts (i.e., when calling client.generate()) with the following options:

max_length: int Sets the per-prompt maximum token length for prompt + response.
min_new_tokens: int Sets the minimum number of tokens generated in the response. max_length will take precedence over this setting.
max_new_tokens: int Sets the maximum number of tokens generated in the response.
ignore_eos: bool (Defaults to False) Setting to True prevents generation from ending when the EOS token is encountered.
top_p: float (Defaults to 0.9) When set below 1.0, filter tokens and keep only the most probable, where token probabilities sum to ≥top_p.
top_k: int (Defaults to None) When None, top-k filtering is disabled. When set, the number of highest probability tokens to keep.
temperature: float (Defaults to None) When None, temperature is disabled. When set, modulates token probabilities.
do_sample: bool (Defaults to True) When True, sample output logits. When False, use greedy sampling.
return_full_text: bool (Defaults to False) When True, prepends the input prompt to the returned text

Contributing

This project welcomes contributions and suggestions.

DeepSpeed-MII has adopted the DCO. All deepspeedai repos require a DCO. (DeepSpeed previously used a CLA which is being replaced with DCO).

DCO is provided by including a sign-off-by line in commit messages. Using the -s flag for git commit will automatically append this line. For example, running git commit -s -m 'commit info.' will produce a commit that has the message commit info. Signed-off-by: My Name <my_email@my_company.com>. The DCO bot will ensure commits are signed with an email address that matches the commit author before they are eligible to be merged.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Transformers

Cons of Transformers

Code Comparison

Pros of fairseq

Cons of fairseq

Code Comparison

Summary

Pros of Megatron-LM

Cons of Megatron-LM

Code Comparison

Pros of AllenNLP

Cons of AllenNLP

Code Comparison

Summary

Pros of gpt-neox

Cons of gpt-neox

Code Comparison

Pros of t5x

Cons of t5x

Code Comparison

Key Differences

Convert designs to code with AI

README

Latest News

Contents

DeepSpeed Model Implementations for Inference (MII)

Key Technologies

MII for High-Throughput Text Generation

MII Legacy

How does MII work?

Supported Models

MII Legacy Model Support

Getting Started with MII

Installation

Non-Persistent Pipeline

Tensor parallelism

Pipeline Options

Persistent Deployment

Model Parallelism

Model Replicas

RESTful API

Persistent Deployment Options

Contributing

Trademarks

Top Related Projects

Convert designs to code with AI