optimum
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
An open-source NLP research library, built on PyTorch.
PyTorch extensions for high performance and large scale training.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Quick Overview
Optimum is an extension of the Hugging Face Transformers library, designed to provide hardware-specific optimizations for training and inference of transformer models. It offers a unified API for various hardware accelerators and optimization techniques, enabling users to easily deploy and optimize their models across different platforms.
Pros
- Seamless integration with Hugging Face Transformers ecosystem
- Support for multiple hardware accelerators (e.g., NVIDIA GPUs, Intel CPUs, Apple Silicon)
- Easy-to-use API for model optimization and quantization
- Improved performance and efficiency for transformer models
Cons
- Limited to transformer-based models
- May require additional hardware-specific dependencies
- Learning curve for users unfamiliar with hardware optimization techniques
- Some optimizations may not be available for all model architectures
Code Examples
- Loading and optimizing a model for Intel CPUs:
from optimum.intel import IPEXModel
from transformers import AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = IPEXModel.from_pretrained(model_name)
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
- Quantizing a model for faster inference:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
quantizer = ORTQuantizer.from_pretrained(model_name)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantized_model = quantizer.quantize(quantization_config=qconfig)
- Using Optimum with Apple Silicon:
from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
bt_model = BetterTransformer.transform(model)
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = bt_model(**inputs)
Getting Started
To get started with Optimum, follow these steps:
- Install Optimum:
pip install optimum
- Install hardware-specific dependencies (e.g., for Intel CPUs):
pip install optimum[intel]
- Use Optimum in your code:
from optimum.intel import IPEXModel
from transformers import AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = IPEXModel.from_pretrained(model_name)
# Your code for inference or fine-tuning
For more detailed information and advanced usage, refer to the Optimum documentation on the Hugging Face website.
Competitor Comparisons
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- More comprehensive optimization techniques, including ZeRO-Infinity for extreme model sizes
- Offers advanced pipeline parallelism and 3D parallelism for distributed training
- Provides more fine-grained control over optimization strategies
Cons of DeepSpeed
- Steeper learning curve and more complex setup compared to Optimum
- Less integrated with Hugging Face ecosystem and transformers library
- May require more manual configuration for optimal performance
Code Comparison
DeepSpeed:
import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
model=model,
model_parameters=params)
Optimum:
from optimum.deepspeed import DeepSpeedConfig
ds_config = DeepSpeedConfig(config_file_or_dict)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
Summary
DeepSpeed offers more advanced optimization techniques and fine-grained control, making it suitable for large-scale distributed training and extreme model sizes. However, it has a steeper learning curve and requires more manual configuration. Optimum, on the other hand, provides easier integration with the Hugging Face ecosystem and a simpler setup process, but may offer fewer advanced optimization options for extreme scenarios.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More comprehensive toolkit for sequence modeling tasks
- Supports a wider range of architectures and models
- Offers more advanced features for research and experimentation
Cons of fairseq
- Steeper learning curve and more complex setup
- Less focus on optimization and deployment
- May require more manual configuration for specific tasks
Code Comparison
fairseq:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
translations = model.translate(['Hello world!'])
optimum:
from optimum.pipelines import pipeline
translator = pipeline("translation", model="t5-small")
result = translator("Hello world!", target_lang="fr")
Key Differences
- fairseq provides more low-level control and customization options
- optimum focuses on ease of use and optimization for various hardware
- fairseq is better suited for research and advanced NLP tasks
- optimum integrates seamlessly with Hugging Face's ecosystem
- fairseq offers more flexibility in model architecture design
- optimum provides better out-of-the-box performance optimization
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- More focused on research and experimentation in NLP
- Provides a rich set of tools for building and evaluating complex NLP models
- Offers a configuration-based approach for easy model definition and experimentation
Cons of AllenNLP
- Steeper learning curve compared to Optimum
- Less emphasis on optimization and deployment across different hardware
- Smaller ecosystem and community compared to the Hugging Face ecosystem
Code Comparison
AllenNLP:
from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import TextField
from allennlp.data.token_indexers import SingleIdTokenIndexer
class MyDatasetReader(DatasetReader):
def _read(self, file_path: str) -> Iterable[Instance]:
with open(file_path, "r") as f:
for line in f:
yield self.text_to_instance(line.strip())
Optimum:
from datasets import load_dataset
from optimum.onnxruntime import ORTModelForSequenceClassification
dataset = load_dataset("glue", "mrpc", split="train")
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)
This comparison highlights the different focus areas of AllenNLP and Optimum. AllenNLP provides more flexibility for custom dataset creation and model architecture, while Optimum emphasizes ease of use and optimization for various hardware platforms within the Hugging Face ecosystem.
PyTorch extensions for high performance and large scale training.
Pros of FairScale
- More focused on large-scale distributed training and optimization techniques
- Offers advanced sharding strategies for model and optimizer states
- Provides implementation of cutting-edge techniques like ZeRO and Fully Sharded Data Parallel
Cons of FairScale
- Less integration with popular model architectures and frameworks
- Steeper learning curve for users not familiar with distributed training concepts
- More limited in scope compared to Optimum's broader optimization features
Code Comparison
FairScale example (Fully Sharded Data Parallel):
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
model = FSDP(model)
Optimum example (Quantization-Aware Training):
from optimum.intel import IPEXModel
model = IPEXModel.from_pretrained("bert-base-uncased")
model.prepare_for_qat()
Both libraries aim to improve model training and inference efficiency, but they approach it differently. FairScale focuses on distributed training and memory optimization, while Optimum provides a broader set of tools for various optimization techniques across different hardware platforms.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Pros of Apex
- Highly optimized for NVIDIA GPUs, offering better performance on supported hardware
- Provides more low-level control over mixed precision training
- Includes additional optimization techniques like LAMB optimizer and fused CUDA kernels
Cons of Apex
- Limited to NVIDIA GPUs, reducing portability across different hardware
- Requires manual installation and setup, which can be complex
- Less frequently updated compared to Optimum
Code Comparison
Apex:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
Optimum:
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
model.half() # Enable mixed precision
loss.backward()
Both libraries aim to improve performance and efficiency in deep learning training, but they approach it differently. Apex focuses on NVIDIA-specific optimizations and mixed precision training, while Optimum provides a more hardware-agnostic solution with a focus on ease of use and integration with Hugging Face's ecosystem. Optimum offers a higher-level API that simplifies the process of applying optimizations, making it more accessible to users who may not need fine-grained control over the optimization process.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Pros of Horovod
- Specializes in distributed deep learning training across multiple GPUs and machines
- Supports multiple deep learning frameworks (TensorFlow, PyTorch, MXNet)
- Highly scalable and efficient for large-scale training tasks
Cons of Horovod
- Steeper learning curve and more complex setup compared to Optimum
- Limited focus on model optimization and deployment
- Less integration with pre-trained models and datasets
Code Comparison
Horovod:
import horovod.tensorflow as hvd
hvd.init()
optimizer = tf.optimizers.Adam(0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
Optimum:
from optimum.onnxruntime import ORTTrainer
trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
Horovod focuses on distributed training across multiple GPUs, while Optimum emphasizes easy integration with Hugging Face's ecosystem and optimization for various hardware. Horovod provides more flexibility for large-scale distributed training, but Optimum offers a more user-friendly approach for optimizing and deploying models, especially those from the Transformers library.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Hugging Face Optimum
ð¤ Optimum is an extension of ð¤ Transformers and Diffusers, providing a set of optimization tools enabling maximum efficiency to train and run models on targeted hardware, while keeping things easy to use.
Installation
ð¤ Optimum can be installed using pip
as follows:
python -m pip install optimum
If you'd like to use the accelerator-specific features of ð¤ Optimum, you can install the required dependencies according to the table below:
Accelerator | Installation |
---|---|
ONNX Runtime | pip install --upgrade --upgrade-strategy eager optimum[onnxruntime] |
Intel Neural Compressor | pip install --upgrade --upgrade-strategy eager optimum[neural-compressor] |
OpenVINO | pip install --upgrade --upgrade-strategy eager optimum[openvino] |
NVIDIA TensorRT-LLM | docker run -it --gpus all --ipc host huggingface/optimum-nvidia |
AMD Instinct GPUs and Ryzen AI NPU | pip install --upgrade --upgrade-strategy eager optimum[amd] |
AWS Trainum & Inferentia | pip install --upgrade --upgrade-strategy eager optimum[neuronx] |
Habana Gaudi Processor (HPU) | pip install --upgrade --upgrade-strategy eager optimum[habana] |
FuriosaAI | pip install --upgrade --upgrade-strategy eager optimum[furiosa] |
The --upgrade --upgrade-strategy eager
option is needed to ensure the different packages are upgraded to the latest possible version.
To install from source:
python -m pip install git+https://github.com/huggingface/optimum.git
For the accelerator-specific features, append optimum[accelerator_type]
to the above command:
python -m pip install optimum[onnxruntime]@git+https://github.com/huggingface/optimum.git
Accelerated Inference
ð¤ Optimum provides multiple tools to export and run optimized models on various ecosystems:
- ONNX / ONNX Runtime
- TensorFlow Lite
- OpenVINO
- Habana first-gen Gaudi / Gaudi2, more details here
- AWS Inferentia 2 / Inferentia 1, more details here
- NVIDIA TensorRT-LLM , more details here
The export and optimizations can be done both programmatically and with a command line.
ONNX + ONNX Runtime
Before you begin, make sure you have all the necessary libraries installed :
pip install optimum[exporters,onnxruntime]
It is possible to export ð¤ Transformers and Diffusers models to the ONNX format and perform graph optimization as well as quantization easily.
For more information on the ONNX export, please check the documentation.
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend.
More details on how to run ONNX models with ORTModelForXXX
classes here.
TensorFlow Lite
Before you begin, make sure you have all the necessary libraries installed :
pip install optimum[exporters-tf]
Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them. You can find more information in our documentation.
Intel (OpenVINO + Neural Compressor + IPEX)
Before you begin, make sure you have all the necessary libraries installed.
You can find more information on the different integration in our documentation and in the examples of optimum-intel
.
Quanto
Quanto is a pytorch quantization backenb which allowss you to quantize a model either using the python API or the optimum-cli
.
You can see more details and examples in the Quanto repository.
Accelerated training
ð¤ Optimum provides wrappers around the original ð¤ Transformers Trainer to enable training on powerful hardware easily. We support many providers:
- Habana's Gaudi processors
- AWS Trainium instances, check here
- ONNX Runtime (optimized for GPUs)
Habana
Before you begin, make sure you have all the necessary libraries installed :
pip install --upgrade --upgrade-strategy eager optimum[habana]
You can find examples in the documentation and in the examples.
ONNX Runtime
Before you begin, make sure you have all the necessary libraries installed :
pip install optimum[onnxruntime-training]
You can find examples in the documentation and in the examples.
Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
An open-source NLP research library, built on PyTorch.
PyTorch extensions for high performance and large scale training.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot