Convert Figma logo to code with AI

huggingface logooptimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools

2,447
432
2,447
413

Top Related Projects

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

30,129

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

11,728

An open-source NLP research library, built on PyTorch.

PyTorch extensions for high performance and large scale training.

8,290

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

14,147

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Quick Overview

Optimum is an extension of the Hugging Face Transformers library, designed to provide hardware-specific optimizations for training and inference of transformer models. It offers a unified API for various hardware accelerators and optimization techniques, enabling users to easily deploy and optimize their models across different platforms.

Pros

  • Seamless integration with Hugging Face Transformers ecosystem
  • Support for multiple hardware accelerators (e.g., NVIDIA GPUs, Intel CPUs, Apple Silicon)
  • Easy-to-use API for model optimization and quantization
  • Improved performance and efficiency for transformer models

Cons

  • Limited to transformer-based models
  • May require additional hardware-specific dependencies
  • Learning curve for users unfamiliar with hardware optimization techniques
  • Some optimizations may not be available for all model architectures

Code Examples

  1. Loading and optimizing a model for Intel CPUs:
from optimum.intel import IPEXModel
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = IPEXModel.from_pretrained(model_name)

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
  1. Quantizing a model for faster inference:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained(model_name)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantized_model = quantizer.quantize(quantization_config=qconfig)
  1. Using Optimum with Apple Silicon:
from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
bt_model = BetterTransformer.transform(model)

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = bt_model(**inputs)

Getting Started

To get started with Optimum, follow these steps:

  1. Install Optimum:
pip install optimum
  1. Install hardware-specific dependencies (e.g., for Intel CPUs):
pip install optimum[intel]
  1. Use Optimum in your code:
from optimum.intel import IPEXModel
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = IPEXModel.from_pretrained(model_name)

# Your code for inference or fine-tuning

For more detailed information and advanced usage, refer to the Optimum documentation on the Hugging Face website.

Competitor Comparisons

34,658

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • More comprehensive optimization techniques, including ZeRO-Infinity for extreme model sizes
  • Highly scalable for distributed training across multiple GPUs and nodes
  • Offers advanced features like pipeline parallelism and 3D parallelism

Cons of DeepSpeed

  • Steeper learning curve and more complex setup compared to Optimum
  • Less integrated with Hugging Face ecosystem and transformers library
  • May require more manual configuration for optimal performance

Code Comparison

DeepSpeed:

import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args, model=model, model_parameters=params)

Optimum:

from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)

Key Differences

  • DeepSpeed focuses on large-scale distributed training and extreme model sizes
  • Optimum provides easier integration with Hugging Face models and pipelines
  • DeepSpeed offers more fine-grained control over optimization techniques
  • Optimum emphasizes simplicity and ease of use for common scenarios

Use Cases

  • DeepSpeed: Training massive language models, distributed across multiple nodes
  • Optimum: Fine-tuning pre-trained models, optimizing inference for deployment
30,129

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • More comprehensive toolkit for sequence modeling tasks
  • Supports a wider range of architectures and models
  • Offers more advanced features for research and experimentation

Cons of fairseq

  • Steeper learning curve and more complex setup
  • Less focus on optimization and deployment
  • May require more manual configuration for specific tasks

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel
model = TransformerModel.from_pretrained('/path/to/model')
translations = model.translate(['Hello world!'])

optimum:

from optimum.pipelines import pipeline
translator = pipeline("translation", model="t5-small")
result = translator("Hello world!", target_lang="fr")

Key Differences

  • fairseq provides more low-level control and customization options
  • optimum focuses on ease of use and optimization for various hardware
  • fairseq is better suited for research and advanced NLP tasks
  • optimum integrates seamlessly with Hugging Face's ecosystem
  • fairseq offers more flexibility in model architecture design
  • optimum provides better out-of-the-box performance optimization
11,728

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

  • More focused on research and experimentation in NLP
  • Provides a rich set of tools for building and evaluating complex NLP models
  • Offers a configuration-based approach for easy model definition and experimentation

Cons of AllenNLP

  • Steeper learning curve compared to Optimum
  • Less emphasis on optimization and deployment across different hardware
  • Smaller ecosystem and community compared to the Hugging Face ecosystem

Code Comparison

AllenNLP:

from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import TextField
from allennlp.data.token_indexers import SingleIdTokenIndexer

class MyDatasetReader(DatasetReader):
    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as f:
            for line in f:
                yield self.text_to_instance(line.strip())

Optimum:

from datasets import load_dataset
from optimum.onnxruntime import ORTModelForSequenceClassification

dataset = load_dataset("glue", "mrpc", split="train")
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)

This comparison highlights the different focus areas of AllenNLP and Optimum. AllenNLP provides more flexibility for custom dataset creation and model architecture, while Optimum emphasizes ease of use and optimization for various hardware platforms within the Hugging Face ecosystem.

PyTorch extensions for high performance and large scale training.

Pros of FairScale

  • More focused on large-scale distributed training and optimization techniques
  • Offers advanced sharding strategies for model and optimizer states
  • Provides implementation of cutting-edge techniques like ZeRO and Fully Sharded Data Parallel

Cons of FairScale

  • Less integration with popular model architectures and frameworks
  • Steeper learning curve for users not familiar with distributed training concepts
  • More limited in scope compared to Optimum's broader optimization features

Code Comparison

FairScale example (Fully Sharded Data Parallel):

from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP

model = FSDP(model)

Optimum example (Quantization-Aware Training):

from optimum.intel import IPEXModel

model = IPEXModel.from_pretrained("bert-base-uncased")
model.prepare_for_qat()

Both libraries aim to improve model training and inference efficiency, but they approach it differently. FairScale focuses on distributed training and memory optimization, while Optimum provides a broader set of tools for various optimization techniques across different hardware platforms.

8,290

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Pros of Apex

  • Highly optimized for NVIDIA GPUs, offering better performance on supported hardware
  • Provides more low-level control over mixed precision training
  • Includes additional optimization techniques like LAMB optimizer and fused CUDA kernels

Cons of Apex

  • Limited to NVIDIA GPUs, reducing portability across different hardware
  • Requires manual installation and setup, which can be complex
  • Less frequently updated compared to Optimum

Code Comparison

Apex:

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Optimum:

from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
model.half()  # Enable mixed precision
loss.backward()

Both libraries aim to improve performance and efficiency in deep learning training, but they approach it differently. Apex focuses on NVIDIA-specific optimizations and mixed precision training, while Optimum provides a more hardware-agnostic solution with a focus on ease of use and integration with Hugging Face's ecosystem. Optimum offers a higher-level API that simplifies the process of applying optimizations, making it more accessible to users who may not need fine-grained control over the optimization process.

14,147

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

  • Specializes in distributed deep learning training across multiple GPUs and machines
  • Supports multiple deep learning frameworks (TensorFlow, PyTorch, MXNet)
  • Highly scalable and efficient for large-scale training tasks

Cons of Horovod

  • Steeper learning curve and more complex setup compared to Optimum
  • Limited focus on model optimization and deployment
  • Less integration with pre-trained models and datasets

Code Comparison

Horovod:

import horovod.tensorflow as hvd
hvd.init()
optimizer = tf.optimizers.Adam(0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)

Optimum:

from optimum.onnxruntime import ORTTrainer
trainer = ORTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

Horovod focuses on distributed training across multiple GPUs, while Optimum emphasizes easy integration with Hugging Face's ecosystem and optimization for various hardware. Horovod provides more flexibility for large-scale distributed training, but Optimum offers a more user-friendly approach for optimizing and deploying models, especially those from the Transformers library.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

ONNX Runtime

Hugging Face Optimum

🤗 Optimum is an extension of 🤗 Transformers and Diffusers, providing a set of optimization tools enabling maximum efficiency to train and run models on targeted hardware, while keeping things easy to use.

Installation

🤗 Optimum can be installed using pip as follows:

python -m pip install optimum

If you'd like to use the accelerator-specific features of 🤗 Optimum, you can install the required dependencies according to the table below:

AcceleratorInstallation
ONNX Runtimepip install --upgrade --upgrade-strategy eager optimum[onnxruntime]
Intel Neural Compressorpip install --upgrade --upgrade-strategy eager optimum[neural-compressor]
OpenVINOpip install --upgrade --upgrade-strategy eager optimum[openvino]
NVIDIA TensorRT-LLMdocker run -it --gpus all --ipc host huggingface/optimum-nvidia
AMD Instinct GPUs and Ryzen AI NPUpip install --upgrade --upgrade-strategy eager optimum[amd]
AWS Trainum & Inferentiapip install --upgrade --upgrade-strategy eager optimum[neuronx]
Habana Gaudi Processor (HPU)pip install --upgrade --upgrade-strategy eager optimum[habana]
FuriosaAIpip install --upgrade --upgrade-strategy eager optimum[furiosa]

The --upgrade --upgrade-strategy eager option is needed to ensure the different packages are upgraded to the latest possible version.

To install from source:

python -m pip install git+https://github.com/huggingface/optimum.git

For the accelerator-specific features, append optimum[accelerator_type] to the above command:

python -m pip install optimum[onnxruntime]@git+https://github.com/huggingface/optimum.git

Accelerated Inference

🤗 Optimum provides multiple tools to export and run optimized models on various ecosystems:

  • ONNX / ONNX Runtime
  • TensorFlow Lite
  • OpenVINO
  • Habana first-gen Gaudi / Gaudi2, more details here
  • AWS Inferentia 2 / Inferentia 1, more details here
  • NVIDIA TensorRT-LLM , more details here

The export and optimizations can be done both programmatically and with a command line.

Features summary

FeaturesONNX RuntimeNeural CompressorOpenVINOTensorFlow Lite
Graph optimization:heavy_check_mark:N/A:heavy_check_mark:N/A
Post-training dynamic quantization:heavy_check_mark::heavy_check_mark:N/A:heavy_check_mark:
Post-training static quantization:heavy_check_mark::heavy_check_mark::heavy_check_mark::heavy_check_mark:
Quantization Aware Training (QAT)N/A:heavy_check_mark::heavy_check_mark:N/A
FP16 (half precision):heavy_check_mark:N/A:heavy_check_mark::heavy_check_mark:
PruningN/A:heavy_check_mark::heavy_check_mark:N/A
Knowledge DistillationN/A:heavy_check_mark::heavy_check_mark:N/A

OpenVINO

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[openvino]

It is possible to export 🤗 Transformers and Diffusers models to the OpenVINO format easily:

optimum-cli export openvino --model distilbert-base-uncased-finetuned-sst-2-english distilbert_sst2_ov

If you add --weight-format int8, the weights will be quantized to int8, check out our documentation for more detail. To apply quantization on both weights and activations, you can find more information here.

To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. To load a PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, you can set export=True when loading your model.

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
  from transformers import AutoTokenizer, pipeline

  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)

  classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
  results = classifier("He's a dreadful magician.")

You can find more examples in the documentation and in the examples.

Neural Compressor

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]

Dynamic quantization can be applied on your model:

optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output ./quantized_distilbert

To load a model quantized with Intel Neural Compressor, hosted locally or on the 🤗 hub, you can do as follows :

from optimum.intel import INCModelForSequenceClassification

model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_id)

You can find more examples in the documentation and in the examples.

ONNX + ONNX Runtime

Before you begin, make sure you have all the necessary libraries installed :

pip install optimum[exporters,onnxruntime]

It is possible to export 🤗 Transformers and Diffusers models to the ONNX format and perform graph optimization as well as quantization easily:

optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx

The model can then be quantized using onnxruntime:

optimum-cli onnxruntime quantize \
  --avx512 \
  --onnx_model roberta_base_qa_onnx \
  -o quantized_roberta_base_qa_onnx

These commands will export deepset/roberta-base-squad2 and perform O2 graph optimization on the exported model, and finally quantize it with the avx512 configuration.

For more information on the ONNX export, please check the documentation.

Run the exported model using ONNX Runtime

Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:

- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
  from transformers import AutoTokenizer, pipeline

  model_id = "deepset/roberta-base-squad2"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
  qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
  question = "What's Optimum?"
  context = "Optimum is an awesome library everyone should use!"
  results = qa_pipe(question=question, context=context)

More details on how to run ONNX models with ORTModelForXXX classes here.

TensorFlow Lite

Before you begin, make sure you have all the necessary libraries installed :

pip install optimum[exporters-tf]

Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them:

optimum-cli export tflite \
  -m deepset/roberta-base-squad2 \
  --sequence_length 384  \
  --quantize int8-dynamic roberta_tflite_model

Accelerated training

🤗 Optimum provides wrappers around the original 🤗 Transformers Trainer to enable training on powerful hardware easily. We support many providers:

  • Habana's Gaudi processors
  • AWS Trainium instances, check here
  • ONNX Runtime (optimized for GPUs)

Habana

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[habana]
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForXxx.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
      output_dir="path/to/save/folder/",
+     use_habana=True,
+     use_lazy_mode=True,
+     gaudi_config_name="Habana/bert-base-uncased",
      ...
  )

  # Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use Habana Gaudi processor for training!
  trainer.train()

You can find more examples in the documentation and in the examples.

ONNX Runtime

- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
      output_dir="path/to/save/folder/",
      optim="adamw_ort_fused",
      ...
  )

  # Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use ONNX Runtime for training!
  trainer.train()

You can find more examples in the documentation and in the examples.

Quanto

Quanto is a pytorch quantization backend.

You can quantize a model either using the python API or the optimum-cli.

from transformers import AutoModelForCausalLM
from optimum.quanto import QuantizedModelForCausalLM, qint4

model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B')
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint4, exclude='lm_head')

The quantized model can be saved using save_pretrained:

qmodel.save_pretrained('./Llama-3.1-8B-quantized')

It can later be reloaded using from_pretrained:

from optimum.quanto import QuantizedModelForCausalLM

qmodel = QuantizedModelForCausalLM.from_pretrained('Llama-3.1-8B-quantized')

You can see more details and examples in the Quanto repository.