DALI
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
Top Related Projects
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Quick Overview
NVIDIA DALI (Data Loading Library) is a GPU-accelerated library for data loading and preprocessing, primarily designed for deep learning applications. It aims to optimize data pipelines, reducing CPU bottlenecks and improving overall training performance, especially for large-scale datasets and complex preprocessing operations.
Pros
- GPU-accelerated data processing, significantly improving performance
- Seamless integration with popular deep learning frameworks like PyTorch and TensorFlow
- Supports a wide range of data formats and preprocessing operations
- Highly optimized for NVIDIA GPUs, leveraging hardware capabilities
Cons
- Limited support for non-NVIDIA hardware
- Steeper learning curve compared to standard data loading libraries
- May require additional GPU memory for data processing
- Not as widely adopted as some other data loading solutions
Code Examples
- Basic image classification pipeline:
import nvidia.dali as dali
import nvidia.dali.fn as fn
@dali.pipeline_def
def image_pipeline():
images, labels = fn.readers.file(file_root="./data", random_shuffle=True)
images = fn.decoders.image(images, device="mixed")
images = fn.resize(images, size=(224, 224))
images = fn.crop_mirror_normalize(images,
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
return images, labels
- Data augmentation example:
@dali.pipeline_def
def augmentation_pipeline():
images, labels = fn.readers.file(file_root="./data", random_shuffle=True)
images = fn.decoders.image(images, device="mixed")
images = fn.resize(images, size=(256, 256))
images = fn.random_resized_crop(images, size=(224, 224))
images = fn.rotate(images, angle=fn.random.uniform(range=(-10, 10)))
images = fn.brightness_contrast(images, brightness=fn.random.uniform(range=(0.9, 1.1)))
return images, labels
- Video processing pipeline:
@dali.pipeline_def
def video_pipeline():
videos, labels = fn.readers.video(device="gpu", filenames="./video_list.txt")
frames = fn.video_decoder(videos, device="gpu", output_type=dali.types.RGB)
frames = fn.resize(frames, size=(224, 224))
frames = fn.crop_mirror_normalize(frames,
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
return frames, labels
Getting Started
- Install NVIDIA DALI:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110
- Create a simple pipeline:
import nvidia.dali as dali
import nvidia.dali.fn as fn
@dali.pipeline_def
def simple_pipeline():
images, labels = fn.readers.file(file_root="./data", random_shuffle=True)
images = fn.decoders.image(images, device="mixed")
images = fn.resize(images, size=(224, 224))
return images, labels
pipe = simple_pipeline(batch_size=32, num_threads=4, device_id=0)
pipe.build()
for batch in pipe.run():
# Process batch
pass
Competitor Comparisons
An Open Source Machine Learning Framework for Everyone
Pros of TensorFlow
- Broader ecosystem and community support
- More comprehensive machine learning framework with wider range of applications
- Extensive documentation and learning resources
Cons of TensorFlow
- Can be slower for data loading and preprocessing compared to DALI
- More complex setup and configuration for GPU acceleration
Code Comparison
TensorFlow data loading:
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset = dataset.shuffle(buffer_size).batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
DALI data loading:
pipe = dali.pipeline.Pipeline(batch_size=batch_size, num_threads=num_threads, device_id=0)
with pipe:
images, labels = dali.fn.readers.file(file_root=data_dir, random_shuffle=True)
images = dali.fn.decoders.image(images, device='mixed')
pipe.build()
DALI focuses on efficient data loading and preprocessing, particularly for GPU-accelerated workflows. It offers faster performance for these tasks compared to TensorFlow's native data pipeline. However, TensorFlow provides a more comprehensive machine learning framework with a larger ecosystem and broader application range. TensorFlow has more extensive documentation and community support, while DALI is more specialized for high-performance data loading in GPU-accelerated environments.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- Broader ecosystem and community support
- More flexible and general-purpose deep learning framework
- Extensive documentation and tutorials
Cons of PyTorch
- Less optimized for data loading and preprocessing
- May require additional libraries for efficient GPU-accelerated data pipelines
Code Comparison
PyTorch:
import torch
from torchvision import transforms
transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
])
DALI:
import nvidia.dali as dali
pipeline = dali.pipeline.Pipeline(batch_size=32, num_threads=4, device_id=0)
with pipeline:
images, labels = dali.fn.readers.file(file_root=data_dir, random_shuffle=True)
images = dali.fn.crop_mirror_normalize(images, dtype=dali.types.FLOAT)
DALI is specifically designed for efficient data loading and preprocessing, particularly for GPU-accelerated workflows. It offers optimized performance for large-scale datasets and complex data augmentation pipelines. PyTorch, on the other hand, provides a more comprehensive deep learning framework with a wider range of applications beyond just data loading. While PyTorch's data loading capabilities are sufficient for many use cases, DALI can be a valuable addition for performance-critical scenarios, especially when working with large datasets or complex preprocessing requirements.
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Pros of MXNet
- Broader ecosystem support and integration with other Apache projects
- More extensive documentation and community resources
- Supports a wider range of deep learning models and applications
Cons of MXNet
- Generally slower performance for data loading and preprocessing
- Less optimized for GPU acceleration, especially on NVIDIA hardware
- More complex setup and configuration for some use cases
Code Comparison
MXNet data loading:
import mxnet as mx
dataset = mx.gluon.data.vision.MNIST('data/mnist')
dataloader = mx.gluon.data.DataLoader(dataset, batch_size=32, shuffle=True)
DALI data loading:
from nvidia.dali import pipeline_def, ops
@pipeline_def
def mnist_pipeline():
images, labels = ops.readers.MNIST(path="data/mnist")
return images, labels
pipe = mnist_pipeline(batch_size=32, num_threads=4, device_id=0)
DALI offers more streamlined and GPU-accelerated data loading, while MXNet provides a more familiar interface for those coming from other deep learning frameworks. DALI is specifically designed for high-performance data loading and preprocessing, whereas MXNet is a full-featured deep learning framework with broader capabilities but potentially lower performance in data pipeline operations.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Pros of Horovod
- Framework-agnostic: Works with TensorFlow, Keras, PyTorch, and MXNet
- Easier to scale distributed training across multiple GPUs and nodes
- Supports both data-parallel and model-parallel training
Cons of Horovod
- Requires more setup and configuration compared to DALI
- May have higher communication overhead in some scenarios
- Less optimized for NVIDIA-specific hardware acceleration
Code Comparison
Horovod:
import horovod.tensorflow as hvd
hvd.init()
optimizer = tf.optimizers.Adam(0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
DALI:
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
@pipeline_def
def image_pipeline():
images = fn.readers.file(file_root="path/to/data")
return images
Key Differences
- Horovod focuses on distributed training across multiple devices and nodes
- DALI specializes in efficient data loading and preprocessing for deep learning
- Horovod is more versatile across frameworks, while DALI is optimized for NVIDIA GPUs
- DALI offers better performance for data loading and augmentation tasks
- Horovod provides more flexibility for scaling training across different hardware setups
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Pros of XGBoost
- Widely adopted and well-established in machine learning community
- Supports multiple programming languages (Python, R, Java, etc.)
- Excellent performance for structured/tabular data
Cons of XGBoost
- Less optimized for GPU acceleration compared to DALI
- Not specifically designed for data loading and preprocessing
Code Comparison
XGBoost (Python):
import xgboost as xgb
dtrain = xgb.DMatrix('train.csv')
param = {'max_depth': 3, 'eta': 0.3, 'objective': 'multi:softprob'}
bst = xgb.train(param, dtrain, num_round=20)
DALI (Python):
import nvidia.dali as dali
pipe = dali.pipeline.Pipeline(batch_size=32, num_threads=4, device_id=0)
with pipe:
images, labels = dali.fn.readers.file(file_root='path/to/data')
images = dali.fn.decoders.image(images, device='mixed')
pipe.build()
While XGBoost focuses on gradient boosting for machine learning tasks, DALI specializes in efficient data loading and preprocessing, particularly for deep learning workflows. XGBoost is more versatile across different data types, while DALI excels in handling image and video data with GPU acceleration.
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Pros of ONNX Runtime
- Broader hardware support, including CPUs and various accelerators
- Extensive ecosystem with wide language and framework compatibility
- More comprehensive model optimization techniques
Cons of ONNX Runtime
- May have higher overhead for simple models or small batch sizes
- Less specialized for GPU-accelerated data processing pipelines
Code Comparison
ONNX Runtime:
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
DALI:
import nvidia.dali as dali
pipe = dali.pipeline.Pipeline(batch_size=32, num_threads=4, device_id=0)
images = dali.fn.readers.file(file_root="images/")
output = dali.fn.decoders.image(images, device="mixed")
pipe.set_outputs(output)
DALI focuses on efficient GPU-accelerated data loading and preprocessing, while ONNX Runtime is a more general-purpose inference engine. DALI excels in high-performance data pipelines, particularly for image and video processing on NVIDIA GPUs. ONNX Runtime offers broader compatibility and optimization across various hardware platforms and model types.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
|License| |Documentation| |Format|
NVIDIA DALI
.. overview-begin-marker-do-not-remove
The NVIDIA Data Loading Library (DALI) is a GPU-accelerated library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. It can be used as a portable drop-in replacement for built in data loaders and data iterators in popular deep learning frameworks.
Deep learning applications require complex, multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and many other augmentations. These data processing pipelines, which are currently executed on the CPU, have become a bottleneck, limiting the performance and scalability of training and inference.
DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline. Features such as prefetching, parallel execution, and batch processing are handled transparently for the user.
In addition, the deep learning frameworks have multiple data pre-processing implementations, resulting in challenges such as portability of training and inference workflows, and code maintainability. Data processing pipelines implemented using DALI are portable because they can easily be retargeted to TensorFlow, PyTorch, and PaddlePaddle.
.. image:: /dali.png :width: 800 :align: center :alt: DALI Diagram
DALI in action:
.. code-block:: python
from nvidia.dali.pipeline import pipeline_def import nvidia.dali.types as types import nvidia.dali.fn as fn from nvidia.dali.plugin.pytorch import DALIGenericIterator import os
To run with different data, see documentation of nvidia.dali.fn.readers.file
points to https://github.com/NVIDIA/DALI_extra
data_root_dir = os.environ['DALI_EXTRA_PATH'] images_dir = os.path.join(data_root_dir, 'db', 'single', 'jpeg')
def loss_func(pred, y): pass
def model(x): pass
def backward(loss, model): pass
@pipeline_def(num_threads=4, device_id=0) def get_dali_pipeline(): images, labels = fn.readers.file( file_root=images_dir, random_shuffle=True, name="Reader") # decode data on the GPU images = fn.decoders.image_random_crop( images, device="mixed", output_type=types.RGB) # the rest of processing happens on the GPU as well images = fn.resize(images, resize_x=256, resize_y=256) images = fn.crop_mirror_normalize( images, crop_h=224, crop_w=224, mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], std=[0.229 * 255, 0.224 * 255, 0.225 * 255], mirror=fn.random.coin_flip()) return images, labels
train_data = DALIGenericIterator( [get_dali_pipeline(batch_size=16)], ['data', 'label'], reader_name='Reader' )
for i, data in enumerate(train_data): x, y = data[0]['data'], data[0]['label'] pred = model(x) loss = loss_func(pred, y) backward(loss, model)
Highlights
- Easy-to-use functional style Python API.
- Multiple data formats support - LMDB, RecordIO, TFRecord, COCO, JPEG, JPEG 2000, WAV, FLAC, OGG, H.264, VP9 and HEVC.
- Portable across popular deep learning frameworks: TensorFlow, PyTorch, PaddlePaddle, JAX.
- Supports CPU and GPU execution.
- Scalable across multiple GPUs.
- Flexible graphs let developers create custom pipelines.
- Extensible for user-specific needs with custom operators.
- Accelerates image classification (ResNet-50), object detection (SSD) workloads as well as ASR models (Jasper, RNN-T).
- Allows direct data path between storage and GPU memory with
GPUDirect Storage <https://developer.nvidia.com/gpudirect-storage>
__. - Easy integration with
NVIDIA Triton Inference Server <https://developer.nvidia.com/nvidia-triton-inference-server>
__ withDALI TRITON Backend <https://github.com/triton-inference-server/dali_backend>
__. - Open source.
.. overview-end-marker-do-not-remove
DALI success stories:
During Kaggle computer vision competitions <https://www.kaggle.com/code/theoviel/rsna-breast-baseline-faster-inference-with-dali>
:"DALI is one of the best things I have learned in this competition" <https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/391059>
Lightning Pose - state of the art pose estimation research model <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10168383/>
__To improve the resource utilization in Advanced Computing Infrastructure <https://arcwiki.rs.gsu.edu/en/dali/using_nvidia_dali_loader>
__MLPerf - the industry standard for benchmarking compute and deep learning hardware and software <https://developer.nvidia.com/blog/mlperf-hpc-v1-0-deep-dive-into-optimizations-leading-to-record-setting-nvidia-performance/>
__"we optimized major models inside eBay with the DALI framework" <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62578/>
__
DALI Roadmap
The following issue represents <https://github.com/NVIDIA/DALI/issues/5320>
__ a high-level overview of our 2024 plan. You should be aware that this
roadmap may change at any time and the order of its items does not reflect any type of priority.
We strongly encourage you to comment on our roadmap and provide us feedback on the mentioned GitHub issue.
Installing DALI
To install the latest DALI release for the latest CUDA version (12.x)::
pip install nvidia-dali-cuda120
# or
pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120
DALI requires NVIDIA driver <https://www.nvidia.com/drivers>
__ supporting the appropriate CUDA version.
In case of DALI based on CUDA 12, it requires CUDA Toolkit <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>
__
to be installed.
DALI comes preinstalled in the TensorFlow <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow>
,
PyTorch <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>
,
and PaddlePaddle <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle>
__
containers on NVIDIA GPU Cloud <https://ngc.nvidia.com>
__.
For other installation paths (TensorFlow plugin, older CUDA version, nightly and weekly builds, etc),
and specific requirements please refer to the Installation Guide <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html>
__.
To build DALI from source, please refer to the Compilation Guide <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/compilation.html>
__.
Examples and Tutorials
An introduction to DALI can be found in the Getting Started <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/getting_started.html>
__ page.
More advanced examples can be found in the Examples and Tutorials <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/index.html>
__ page.
For an interactive version (Jupyter notebook) of the examples, go to the docs/examples <https://github.com/NVIDIA/DALI/blob/main/docs/examples>
__
directory.
Note: Select the Latest Release Documentation <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html>
__
or the Nightly Release Documentation <https://docs.nvidia.com/deeplearning/dali/main-user-guide/docs/index.html>
__, which stays in sync with the main branch,
depending on your version.
Additional Resources
- GPU Technology Conference 2024; Optimizing Inference Model Serving for Highest Performance at eBay; Yiheng Wang:
event <https://www.nvidia.com/en-us/on-demand/session/gtc24-s62578/>
__ - GPU Technology Conference 2023; Developer Breakout: Accelerating Enterprise Workflows With Triton Server and DALI; Brandon Tuttle:
event <https://www.nvidia.com/en-us/on-demand/session/gtcspring23-se52140/>
__. - GPU Technology Conference 2023; GPU-Accelerating End-to-End Geospatial Workflows; Kevin Green:
event <https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51796/>
__. - GPU Technology Conference 2022; Effective NVIDIA DALI: Accelerating Real-life Deep-learning Applications; RafaÅ BanaÅ:
event <https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41442/>
__. - GPU Technology Conference 2022; Introduction to NVIDIA DALI: GPU-accelerated Data Preprocessing; Joaquin Anton Guirao:
event <https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41443/>
__. - GPU Technology Conference 2021; NVIDIA DALI: GPU-Powered Data Preprocessing by Krzysztof ÅÄcki and MichaÅ SzoÅucha:
event <https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31298/>
__. - GPU Technology Conference 2020; Fast Data Pre-Processing with NVIDIA Data Loading Library (DALI); Albert Wolant, Joaquin Anton Guirao:
recording <https://developer.nvidia.com/gtc/2020/video/s21139>
__. - GPU Technology Conference 2019; Fast AI data pre-preprocessing with DALI; Janusz Lisiecki, MichaÅ Zientkiewicz:
slides <https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9925-fast-ai-data-pre-processing-with-nvidia-dali.pdf>
,recording <https://developer.nvidia.com/gtc/2019/video/S9925/video>
. - GPU Technology Conference 2019; Integration of DALI with TensorRT on Xavier; Josh Park and Anurag Dixit:
slides <https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9818-integration-of-tensorrt-with-dali-on-xavier.pdf>
,recording <https://developer.nvidia.com/gtc/2019/video/S9818/video>
. - GPU Technology Conference 2018; Fast data pipeline for deep learning training, T. Gale, S. Layton and P. TrÄdak:
slides <http://on-demand.gputechconf.com/gtc/2018/presentation/s8906-fast-data-pipelines-for-deep-learning-training.pdf>
,recording <https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8906/>
. Developer Page <https://developer.nvidia.com/DALI>
__.Blog Posts <https://developer.nvidia.com/blog/tag/dali/>
__.
Contributing to DALI
We welcome contributions to DALI. To contribute to DALI and make pull requests,
follow the guidelines outlined in the Contributing <https://github.com/NVIDIA/DALI/blob/main/CONTRIBUTING.md>
__
document.
If you are looking for a task good for the start please check one from
external contribution welcome label <https://github.com/NVIDIA/DALI/labels/external%20contribution%20welcome>
__.
Reporting Problems, Asking Questions
We appreciate feedback, questions or bug reports. When you need help
with the code, follow the process outlined in the Stack Overflow <https://stackoverflow.com/help/mcve>
__ document. Ensure that the
posted examples are:
- minimal: Use as little code as possible that still produces the same problem.
- complete: Provide all parts needed to reproduce the problem. Check if you can strip external dependency and still show the problem. The less time we spend on reproducing the problems, the more time we can dedicate to the fixes.
- verifiable: Test the code you are about to provide, to make sure that it reproduces the problem. Remove all other problems that are not related to your request.
Acknowledgements
DALI was originally built with major contributions from Trevor Gale, Przemek Tredak, Simon Layton, Andrei Ivanov and Serge Panev.
.. |License| image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg :target: https://opensource.org/licenses/Apache-2.0
.. |Documentation| image:: https://img.shields.io/badge/NVIDIA%20DALI-documentation-brightgreen.svg?longCache=true :target: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
.. |Format| image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/psf/black
Top Related Projects
An Open Source Machine Learning Framework for Everyone
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot