optimum
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
An open-source NLP research library, built on PyTorch.
PyTorch extensions for high performance and large scale training.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Quick Overview
Optimum is an extension of the Hugging Face Transformers library, designed to provide hardware-specific optimizations for training and inference of transformer models. It offers a unified API for various hardware accelerators and optimization techniques, enabling users to easily deploy and optimize their models across different platforms.
Pros
- Seamless integration with Hugging Face Transformers ecosystem
- Support for multiple hardware accelerators (e.g., NVIDIA GPUs, Intel CPUs, Apple Silicon)
- Easy-to-use API for model optimization and quantization
- Improved performance and efficiency for transformer models
Cons
- Limited to transformer-based models
- May require additional hardware-specific dependencies
- Learning curve for users unfamiliar with hardware optimization techniques
- Some optimizations may not be available for all model architectures
Code Examples
- Loading and optimizing a model for Intel CPUs:
from optimum.intel import IPEXModel
from transformers import AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = IPEXModel.from_pretrained(model_name)
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
- Quantizing a model for faster inference:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
quantizer = ORTQuantizer.from_pretrained(model_name)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantized_model = quantizer.quantize(quantization_config=qconfig)
- Using Optimum with Apple Silicon:
from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
bt_model = BetterTransformer.transform(model)
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = bt_model(**inputs)
Getting Started
To get started with Optimum, follow these steps:
- Install Optimum:
pip install optimum
- Install hardware-specific dependencies (e.g., for Intel CPUs):
pip install optimum[intel]
- Use Optimum in your code:
from optimum.intel import IPEXModel
from transformers import AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = IPEXModel.from_pretrained(model_name)
# Your code for inference or fine-tuning
For more detailed information and advanced usage, refer to the Optimum documentation on the Hugging Face website.
Competitor Comparisons
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Pros of DeepSpeed
- More comprehensive optimization techniques, including ZeRO-Infinity for extreme model sizes
- Highly scalable for distributed training across multiple GPUs and nodes
- Offers advanced features like pipeline parallelism and 3D parallelism
Cons of DeepSpeed
- Steeper learning curve and more complex setup compared to Optimum
- Less integrated with Hugging Face ecosystem and transformers library
- May require more manual configuration for optimal performance
Code Comparison
DeepSpeed:
import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
args=args, model=model, model_parameters=params)
Optimum:
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
Key Differences
- DeepSpeed focuses on large-scale distributed training and extreme model sizes
- Optimum provides easier integration with Hugging Face models and pipelines
- DeepSpeed offers more fine-grained control over optimization techniques
- Optimum emphasizes simplicity and ease of use for common scenarios
Use Cases
- DeepSpeed: Training massive language models, distributed across multiple nodes
- Optimum: Fine-tuning pre-trained models, optimizing inference for deployment
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Pros of fairseq
- More comprehensive toolkit for sequence modeling tasks
- Supports a wider range of architectures and models
- Offers more advanced features for research and experimentation
Cons of fairseq
- Steeper learning curve and more complex setup
- Less focus on optimization and deployment
- May require more manual configuration for specific tasks
Code Comparison
fairseq:
from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
translations = model.translate(['Hello world!'])
optimum:
from optimum.pipelines import pipeline
translator = pipeline("translation", model="t5-small")
result = translator("Hello world!", target_lang="fr")
Key Differences
- fairseq provides more low-level control and customization options
- optimum focuses on ease of use and optimization for various hardware
- fairseq is better suited for research and advanced NLP tasks
- optimum integrates seamlessly with Hugging Face's ecosystem
- fairseq offers more flexibility in model architecture design
- optimum provides better out-of-the-box performance optimization
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- More focused on research and experimentation in NLP
- Provides a rich set of tools for building and evaluating complex NLP models
- Offers a configuration-based approach for easy model definition and experimentation
Cons of AllenNLP
- Steeper learning curve compared to Optimum
- Less emphasis on optimization and deployment across different hardware
- Smaller ecosystem and community compared to the Hugging Face ecosystem
Code Comparison
AllenNLP:
from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import TextField
from allennlp.data.token_indexers import SingleIdTokenIndexer
class MyDatasetReader(DatasetReader):
def _read(self, file_path: str) -> Iterable[Instance]:
with open(file_path, "r") as f:
for line in f:
yield self.text_to_instance(line.strip())
Optimum:
from datasets import load_dataset
from optimum.onnxruntime import ORTModelForSequenceClassification
dataset = load_dataset("glue", "mrpc", split="train")
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)
This comparison highlights the different focus areas of AllenNLP and Optimum. AllenNLP provides more flexibility for custom dataset creation and model architecture, while Optimum emphasizes ease of use and optimization for various hardware platforms within the Hugging Face ecosystem.
PyTorch extensions for high performance and large scale training.
Pros of FairScale
- More focused on large-scale distributed training and optimization techniques
- Offers advanced sharding strategies for model and optimizer states
- Provides implementation of cutting-edge techniques like ZeRO and Fully Sharded Data Parallel
Cons of FairScale
- Less integration with popular model architectures and frameworks
- Steeper learning curve for users not familiar with distributed training concepts
- More limited in scope compared to Optimum's broader optimization features
Code Comparison
FairScale example (Fully Sharded Data Parallel):
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
model = FSDP(model)
Optimum example (Quantization-Aware Training):
from optimum.intel import IPEXModel
model = IPEXModel.from_pretrained("bert-base-uncased")
model.prepare_for_qat()
Both libraries aim to improve model training and inference efficiency, but they approach it differently. FairScale focuses on distributed training and memory optimization, while Optimum provides a broader set of tools for various optimization techniques across different hardware platforms.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Pros of Apex
- Highly optimized for NVIDIA GPUs, offering better performance on supported hardware
- Provides more low-level control over mixed precision training
- Includes additional optimization techniques like LAMB optimizer and fused CUDA kernels
Cons of Apex
- Limited to NVIDIA GPUs, reducing portability across different hardware
- Requires manual installation and setup, which can be complex
- Less frequently updated compared to Optimum
Code Comparison
Apex:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
Optimum:
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
model.half() # Enable mixed precision
loss.backward()
Both libraries aim to improve performance and efficiency in deep learning training, but they approach it differently. Apex focuses on NVIDIA-specific optimizations and mixed precision training, while Optimum provides a more hardware-agnostic solution with a focus on ease of use and integration with Hugging Face's ecosystem. Optimum offers a higher-level API that simplifies the process of applying optimizations, making it more accessible to users who may not need fine-grained control over the optimization process.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Pros of Horovod
- Specializes in distributed deep learning training across multiple GPUs and machines
- Supports multiple deep learning frameworks (TensorFlow, PyTorch, MXNet)
- Highly scalable and efficient for large-scale training tasks
Cons of Horovod
- Steeper learning curve and more complex setup compared to Optimum
- Limited focus on model optimization and deployment
- Less integration with pre-trained models and datasets
Code Comparison
Horovod:
import horovod.tensorflow as hvd
hvd.init()
optimizer = tf.optimizers.Adam(0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
Optimum:
from optimum.onnxruntime import ORTTrainer
trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
Horovod focuses on distributed training across multiple GPUs, while Optimum emphasizes easy integration with Hugging Face's ecosystem and optimization for various hardware. Horovod provides more flexibility for large-scale distributed training, but Optimum offers a more user-friendly approach for optimizing and deploying models, especially those from the Transformers library.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Hugging Face Optimum
ð¤ Optimum is an extension of ð¤ Transformers and Diffusers, providing a set of optimization tools enabling maximum efficiency to train and run models on targeted hardware, while keeping things easy to use.
Installation
ð¤ Optimum can be installed using pip
as follows:
python -m pip install optimum
If you'd like to use the accelerator-specific features of ð¤ Optimum, you can install the required dependencies according to the table below:
Accelerator | Installation |
---|---|
ONNX Runtime | pip install --upgrade --upgrade-strategy eager optimum[onnxruntime] |
Intel Neural Compressor | pip install --upgrade --upgrade-strategy eager optimum[neural-compressor] |
OpenVINO | pip install --upgrade --upgrade-strategy eager optimum[openvino] |
NVIDIA TensorRT-LLM | docker run -it --gpus all --ipc host huggingface/optimum-nvidia |
AMD Instinct GPUs and Ryzen AI NPU | pip install --upgrade --upgrade-strategy eager optimum[amd] |
AWS Trainum & Inferentia | pip install --upgrade --upgrade-strategy eager optimum[neuronx] |
Habana Gaudi Processor (HPU) | pip install --upgrade --upgrade-strategy eager optimum[habana] |
FuriosaAI | pip install --upgrade --upgrade-strategy eager optimum[furiosa] |
The --upgrade --upgrade-strategy eager
option is needed to ensure the different packages are upgraded to the latest possible version.
To install from source:
python -m pip install git+https://github.com/huggingface/optimum.git
For the accelerator-specific features, append optimum[accelerator_type]
to the above command:
python -m pip install optimum[onnxruntime]@git+https://github.com/huggingface/optimum.git
Accelerated Inference
ð¤ Optimum provides multiple tools to export and run optimized models on various ecosystems:
- ONNX / ONNX Runtime
- TensorFlow Lite
- OpenVINO
- Habana first-gen Gaudi / Gaudi2, more details here
- AWS Inferentia 2 / Inferentia 1, more details here
- NVIDIA TensorRT-LLM , more details here
The export and optimizations can be done both programmatically and with a command line.
Features summary
Features | ONNX Runtime | Neural Compressor | OpenVINO | TensorFlow Lite |
---|---|---|---|---|
Graph optimization | :heavy_check_mark: | N/A | :heavy_check_mark: | N/A |
Post-training dynamic quantization | :heavy_check_mark: | :heavy_check_mark: | N/A | :heavy_check_mark: |
Post-training static quantization | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
Quantization Aware Training (QAT) | N/A | :heavy_check_mark: | :heavy_check_mark: | N/A |
FP16 (half precision) | :heavy_check_mark: | N/A | :heavy_check_mark: | :heavy_check_mark: |
Pruning | N/A | :heavy_check_mark: | :heavy_check_mark: | N/A |
Knowledge Distillation | N/A | :heavy_check_mark: | :heavy_check_mark: | N/A |
OpenVINO
Before you begin, make sure you have all the necessary libraries installed :
pip install --upgrade --upgrade-strategy eager optimum[openvino]
It is possible to export ð¤ Transformers and Diffusers models to the OpenVINO format easily:
optimum-cli export openvino --model distilbert-base-uncased-finetuned-sst-2-english distilbert_sst2_ov
If you add --weight-format int8
, the weights will be quantized to int8
, check out our documentation for more detail. To apply quantization on both weights and activations, you can find more information here.
To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx
class with the corresponding OVModelForXxx
class. To load a PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, you can set export=True
when loading your model.
- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
results = classifier("He's a dreadful magician.")
You can find more examples in the documentation and in the examples.
Neural Compressor
Before you begin, make sure you have all the necessary libraries installed :
pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]
Dynamic quantization can be applied on your model:
optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output ./quantized_distilbert
To load a model quantized with Intel Neural Compressor, hosted locally or on the ð¤ hub, you can do as follows :
from optimum.intel import INCModelForSequenceClassification
model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_id)
You can find more examples in the documentation and in the examples.
ONNX + ONNX Runtime
Before you begin, make sure you have all the necessary libraries installed :
pip install optimum[exporters,onnxruntime]
It is possible to export ð¤ Transformers and Diffusers models to the ONNX format and perform graph optimization as well as quantization easily:
optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx
The model can then be quantized using onnxruntime
:
optimum-cli onnxruntime quantize \
--avx512 \
--onnx_model roberta_base_qa_onnx \
-o quantized_roberta_base_qa_onnx
These commands will export deepset/roberta-base-squad2
and perform O2 graph optimization on the exported model, and finally quantize it with the avx512 configuration.
For more information on the ONNX export, please check the documentation.
Run the exported model using ONNX Runtime
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:
- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
model_id = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
question = "What's Optimum?"
context = "Optimum is an awesome library everyone should use!"
results = qa_pipe(question=question, context=context)
More details on how to run ONNX models with ORTModelForXXX
classes here.
TensorFlow Lite
Before you begin, make sure you have all the necessary libraries installed :
pip install optimum[exporters-tf]
Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them:
optimum-cli export tflite \
-m deepset/roberta-base-squad2 \
--sequence_length 384 \
--quantize int8-dynamic roberta_tflite_model
Accelerated training
ð¤ Optimum provides wrappers around the original ð¤ Transformers Trainer to enable training on powerful hardware easily. We support many providers:
- Habana's Gaudi processors
- AWS Trainium instances, check here
- ONNX Runtime (optimized for GPUs)
Habana
Before you begin, make sure you have all the necessary libraries installed :
pip install --upgrade --upgrade-strategy eager optimum[habana]
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForXxx.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
output_dir="path/to/save/folder/",
+ use_habana=True,
+ use_lazy_mode=True,
+ gaudi_config_name="Habana/bert-base-uncased",
...
)
# Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
# Use Habana Gaudi processor for training!
trainer.train()
You can find more examples in the documentation and in the examples.
ONNX Runtime
- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
output_dir="path/to/save/folder/",
optim="adamw_ort_fused",
...
)
# Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
# Use ONNX Runtime for training!
trainer.train()
You can find more examples in the documentation and in the examples.
Quanto
Quanto is a pytorch quantization backend.
You can quantize a model either using the python API or the optimum-cli
.
from transformers import AutoModelForCausalLM
from optimum.quanto import QuantizedModelForCausalLM, qint4
model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint4, exclude='lm_head')
The quantized model can be saved using save_pretrained
:
qmodel.save_pretrained('./Llama-3.1-8B-quantized')
It can later be reloaded using from_pretrained
:
from optimum.quanto import QuantizedModelForCausalLM
qmodel = QuantizedModelForCausalLM.from_pretrained('Llama-3.1-8B-quantized')
You can see more details and examples in the Quanto repository.
Top Related Projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
An open-source NLP research library, built on PyTorch.
PyTorch extensions for high performance and large scale training.
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot