TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Top Related Projects
An Open Source Machine Learning Framework for Everyone
Open standard for machine learning interoperability
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Open deep learning compiler stack for cpu, gpu and specialized accelerators
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
mlpack: a fast, header-only C++ machine learning library
Quick Overview
TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. It's designed to work with models trained in all major frameworks and can be deployed on various NVIDIA GPUs in data centers, embedded platforms, and automotive solutions.
Pros
- Significantly accelerates inference performance on NVIDIA GPUs
- Supports a wide range of deep learning frameworks and model types
- Provides optimization techniques like layer fusion, precision calibration, and dynamic tensor memory
- Offers both C++ and Python APIs for integration flexibility
Cons
- Limited to NVIDIA GPUs, not usable on other hardware platforms
- Can have a steep learning curve for beginners
- Optimization process may require manual tuning for best results
- Limited support for certain custom layers or operations
Code Examples
- Building a TensorRT engine from an ONNX model:
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
with open("model.onnx", "rb") as model:
parser.parse(model.read())
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1GB
engine = builder.build_engine(network, config)
- Performing inference with TensorRT:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
# Assuming 'engine' is a pre-built TensorRT engine
context = engine.create_execution_context()
# Prepare input and output buffers
input_shape = (1, 3, 224, 224) # Example shape
output_shape = (1, 1000) # Example shape
h_input = cuda.pagelocked_empty(input_shape, dtype=np.float32)
h_output = cuda.pagelocked_empty(output_shape, dtype=np.float32)
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
stream = cuda.Stream()
# Copy input data to device
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference
context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
# Transfer predictions back
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream
stream.synchronize()
# h_output now contains the inference results
- INT8 Calibration for quantization:
import tensorrt as trt
def get_int8_calibrator():
return trt.IInt8EntropyCalibrator2(
dataloader=your_calibration_dataloader,
cache_file="calibration.cache"
)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = get_int8_calibrator()
engine = builder.build_engine(network, config)
Getting Started
-
Install TensorRT:
pip install nvidia-tensorrt
-
Convert your model to ONNX format if not already done.
-
Use the TensorRT Python API to build an engine:
import tensorrt as trt logger = trt.Logger(trt.Logger.WARNING) builder = trt.Builder(logger) network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) parser = trt.OnnxParser(network, logger) with open("your_model.onnx", "rb") as model: parser.parse(model.read())
Competitor Comparisons
An Open Source Machine Learning Framework for Everyone
Pros of TensorFlow
- Broader ecosystem with extensive libraries and tools
- Supports a wider range of hardware platforms
- More flexible and customizable for various ML tasks
Cons of TensorFlow
- Generally slower inference performance than TensorRT
- Steeper learning curve for optimization and deployment
Code Comparison
TensorFlow:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
TensorRT:
import tensorrt as trt
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)
Key Differences
- TensorFlow is a complete ML framework, while TensorRT focuses on optimizing inference
- TensorRT is specifically designed for NVIDIA GPUs, offering better performance on supported hardware
- TensorFlow provides more flexibility in model development, while TensorRT excels in deployment optimization
Use Cases
- TensorFlow: General-purpose ML development, research, and prototyping
- TensorRT: High-performance inference deployment on NVIDIA GPUs, especially in production environments
Open standard for machine learning interoperability
Pros of ONNX
- Platform-independent, supporting a wide range of frameworks and hardware
- Open-source with broad industry support and collaboration
- Extensive ecosystem of tools and libraries for model conversion and optimization
Cons of ONNX
- May require additional steps for deployment and optimization
- Performance can vary depending on the target platform and implementation
Code Comparison
ONNX model loading:
import onnx
model = onnx.load("model.onnx")
TensorRT model loading:
import tensorrt as trt
with trt.Builder(TRT_LOGGER) as builder:
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)
parser.parse_from_file("model.onnx")
Key Differences
- ONNX focuses on model interoperability and standardization
- TensorRT specializes in NVIDIA GPU optimization and acceleration
- ONNX provides a common format for various frameworks, while TensorRT is tailored for high-performance inference on NVIDIA hardware
- TensorRT offers advanced optimization techniques specific to NVIDIA GPUs
- ONNX has broader ecosystem support, while TensorRT excels in NVIDIA-specific deployments
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Pros of PyTorch
- More flexible and user-friendly for research and prototyping
- Supports dynamic computational graphs, allowing for easier debugging
- Larger community and ecosystem with more resources and third-party libraries
Cons of PyTorch
- Generally slower inference performance compared to TensorRT
- Less optimized for deployment on NVIDIA hardware
- Requires more manual optimization for production environments
Code Comparison
PyTorch:
import torch
model = torch.nn.Linear(10, 5)
input_tensor = torch.randn(3, 10)
output = model(input_tensor)
TensorRT:
import tensorrt as trt
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)
parser.parse_from_file("model.onnx")
engine = builder.build_cuda_engine(network)
The PyTorch example shows a simple linear model creation and inference, while the TensorRT example demonstrates the process of parsing an ONNX model and building an optimized CUDA engine. TensorRT requires more setup but offers better performance for deployment on NVIDIA hardware.
Open deep learning compiler stack for cpu, gpu and specialized accelerators
Pros of TVM
- Hardware-agnostic: Supports a wide range of hardware platforms, not limited to NVIDIA GPUs
- Open-source community: Larger and more diverse contributor base, potentially leading to faster innovation
- Flexibility: Offers more customization options for optimizing deep learning models
Cons of TVM
- Learning curve: Generally requires more expertise to use effectively compared to TensorRT
- Performance: May not achieve the same level of optimization as TensorRT on NVIDIA hardware
- Maturity: Less mature ecosystem and tooling compared to TensorRT
Code Comparison
TVM example:
import tvm
from tvm import relay
# Define and compile a model
mod, params = relay.testing.resnet.get_workload()
target = tvm.target.cuda()
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target, params=params)
TensorRT example:
import tensorrt as trt
# Create a TensorRT builder and network
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)
parser.parse_from_file(model_path)
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Pros of ONNX Runtime
- Cross-platform compatibility: Supports a wide range of hardware and operating systems
- Broader model format support: Works with models from various frameworks, not just TensorRT
- Active open-source community: More frequent updates and contributions
Cons of ONNX Runtime
- Generally lower performance on NVIDIA GPUs compared to TensorRT
- Less optimized for NVIDIA-specific hardware features
- May require additional steps for model conversion and optimization
Code Comparison
ONNX Runtime:
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": input_data})
TensorRT:
import tensorrt as trt
engine = trt.Runtime(trt.Logger(trt.Logger.WARNING)).deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
output = context.execute_v2([input_data])
Both ONNX Runtime and TensorRT are powerful inference engines, but they cater to different use cases. ONNX Runtime offers greater flexibility and cross-platform support, making it suitable for a wide range of applications. TensorRT, on the other hand, excels in performance optimization for NVIDIA GPUs, making it the preferred choice for high-performance inference on NVIDIA hardware.
mlpack: a fast, header-only C++ machine learning library
Pros of mlpack
- Open-source and platform-independent, allowing for wider accessibility and community contributions
- Extensive collection of machine learning algorithms and utilities
- Supports multiple programming languages (C++, Python, Julia, Go, R)
Cons of mlpack
- Generally slower performance compared to TensorRT's optimized inference
- Less focus on deep learning and neural network acceleration
- Smaller community and ecosystem compared to NVIDIA-backed projects
Code Comparison
mlpack (C++):
#include <mlpack/core.hpp>
#include <mlpack/methods/neighbor_search/neighbor_search.hpp>
using namespace mlpack;
arma::mat data;
data::Load("dataset.csv", data, true);
NeighborSearch<NearestNeighborSort> nn(data);
TensorRT (C++):
#include "NvInfer.h"
#include "NvOnnxParser.h"
auto builder = nvinfer1::createInferBuilder(logger);
auto network = builder->createNetworkV2(1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));
auto parser = nvonnxparser::createParser(*network, logger);
Both repositories offer machine learning capabilities, but mlpack provides a broader range of algorithms and language support, while TensorRT focuses on optimizing deep learning inference for NVIDIA GPUs.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
TensorRT Open Source Software
This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. It includes the sources for TensorRT plugins and ONNX parser, as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.
- For code contributions to TensorRT-OSS, please see our Contribution Guide and Coding Guidelines.
- For a summary of new additions and updates shipped with TensorRT-OSS releases, please refer to the Changelog.
- For business inquiries, please contact researchinquiries@nvidia.com
- For press and other inquiries, please contact Hector Marinez at hmarinez@nvidia.com
Need enterprise support? NVIDIA global support is available for TensorRT with the NVIDIA AI Enterprise software suite. Check out NVIDIA LaunchPad for free access to a set of hands-on labs with TensorRT hosted on NVIDIA infrastructure.
Join the TensorRT and Triton community and stay current on the latest product updates, bug fixes, content, best practices, and more.
Prebuilt TensorRT Python Package
We provide the TensorRT Python package for an easy installation.
To install:
pip install tensorrt
You can skip the Build section to enjoy TensorRT with Python.
Build
Prerequisites
To build the TensorRT-OSS components, you will first need the following software packages.
TensorRT GA build
- TensorRT v10.7.0.23
- Available from direct download links listed below
System Packages
- CUDA
- Recommended versions:
- cuda-12.6.0 + cuDNN-8.9
- cuda-11.8.0 + cuDNN-8.9
- GNU make >= v4.1
- cmake >= v3.13
- python >= v3.8, <= v3.10.x
- pip >= v19.0
- Essential utilities
Optional Packages
- Containerized build
- Docker >= 19.03
- NVIDIA Container Toolkit
- PyPI packages (for demo applications/tests)
- onnx
- onnxruntime
- tensorflow-gpu >= 2.5.1
- Pillow >= 9.0.1
- pycuda < 2021.1
- numpy
- pytest
- Code formatting tools (for contributors)
NOTE: onnx-tensorrt, cub, and protobuf packages are downloaded along with TensorRT OSS, and not required to be installed.
Downloading TensorRT Build
-
Download TensorRT OSS
git clone -b main https://github.com/nvidia/TensorRT TensorRT cd TensorRT git submodule update --init --recursive
-
(Optional - if not using TensorRT container) Specify the TensorRT GA release build path
If using the TensorRT OSS build container, TensorRT libraries are preinstalled under
/usr/lib/x86_64-linux-gnu
and you may skip this step.Else download and extract the TensorRT GA build from NVIDIA Developer Zone with the direct links below:
- TensorRT 10.7.0.23 for CUDA 11.8, Linux x86_64
- TensorRT 10.7.0.23 for CUDA 12.6, Linux x86_64
- TensorRT 10.7.0.23 for CUDA 11.8, Windows x86_64
- TensorRT 10.7.0.23 for CUDA 12.6, Windows x86_64
Example: Ubuntu 20.04 on x86-64 with cuda-12.6
cd ~/Downloads tar -xvzf TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-12.6.tar.gz export TRT_LIBPATH=`pwd`/TensorRT-10.7.0.23
Example: Windows on x86-64 with cuda-12.6
Expand-Archive -Path TensorRT-10.7.0.23.Windows.win10.cuda-12.6.zip $env:TRT_LIBPATH="$pwd\TensorRT-10.7.0.23\lib"
Setting Up The Build Environment
For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, please install the prerequisite System Packages.
-
Generate the TensorRT-OSS build container.
The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build scripts. The build containers are configured for building TensorRT OSS out-of-the-box.
Example: Ubuntu 20.04 on x86-64 with cuda-12.6 (default)
./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.6
Example: Rockylinux8 on x86-64 with cuda-12.6
./docker/build.sh --file docker/rockylinux8.Dockerfile --tag tensorrt-rockylinux8-cuda12.6
Example: Ubuntu 22.04 cross-compile for Jetson (aarch64) with cuda-12.6 (JetPack SDK)
./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda12.6
Example: Ubuntu 22.04 on aarch64 with cuda-12.6
./docker/build.sh --file docker/ubuntu-22.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu22.04-cuda12.6
-
Launch the TensorRT-OSS build container.
Example: Ubuntu 20.04 build container
./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda12.6 --gpus all
NOTE:
1. Use the--tag
corresponding to build container generated in Step 1.
2. NVIDIA Container Toolkit is required for GPU access (running TensorRT applications) inside the build container.
3.sudo
password for Ubuntu build containers is 'nvidia'.
4. Specify port number using--jupyter <port>
for launching Jupyter notebooks.
Building TensorRT-OSS
-
Generate Makefiles and build.
Example: Linux (x86-64) build with default cuda-12.6
cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out make -j$(nproc)
Example: Linux (aarch64) build with default cuda-12.6
cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain make -j$(nproc)
Example: Native build on Jetson (aarch64) with cuda-12.6
cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=12.6 CC=/usr/bin/gcc make -j$(nproc)
NOTE: C compiler must be explicitly specified via CC= for native aarch64 builds of protobuf.
Example: Ubuntu 22.04 Cross-Compile for Jetson (aarch64) with cuda-12.6 (JetPack)
cd $TRT_OSSPATH mkdir -p build && cd build cmake .. -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=12.6 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-12.6/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-12.6/targets/aarch64-linux/lib/stubs/libcublasLt.so -DTRT_LIB_DIR=/pdk_files/tensorrt/lib make -j$(nproc)
Example: Native builds on Windows (x86) with cuda-12.6
cd $TRT_OSSPATH mkdir -p build cd -p build cmake .. -DTRT_LIB_DIR="$env:TRT_LIBPATH" -DCUDNN_ROOT_DIR="$env:CUDNN_PATH" -DTRT_OUT_DIR="$pwd\\out" msbuild TensorRT.sln /property:Configuration=Release -m:$env:NUMBER_OF_PROCESSORS
NOTE:
1. The default CUDA version used by CMake is 12.4.0. To override this, for example to 11.8, append-DCUDA_VERSION=11.8
to the cmake command. -
Required CMake build arguments are:
TRT_LIB_DIR
: Path to the TensorRT installation directory containing libraries.TRT_OUT_DIR
: Output directory where generated build artifacts will be copied.
-
Optional CMake build arguments:
CMAKE_BUILD_TYPE
: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [Release
] |Debug
CUDA_VERSION
: The version of CUDA to target, for example [11.7.1
].CUDNN_VERSION
: The version of cuDNN to target, for example [8.6
].PROTOBUF_VERSION
: The version of Protobuf to use, for example [3.0.0
]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version.CMAKE_TOOLCHAIN_FILE
: The path to a toolchain file for cross compilation.BUILD_PARSERS
: Specify if the parsers should be built, for example [ON
] |OFF
. If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in${TRT_LIB_DIR}
, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.BUILD_PLUGINS
: Specify if the plugins should be built, for example [ON
] |OFF
. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in${TRT_LIB_DIR}
, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.BUILD_SAMPLES
: Specify if the samples should be built, for example [ON
] |OFF
.GPU_ARCHS
: GPU (SM) architectures to target. By default we generate CUDA code for all major SMs. Specific SM versions can be specified here as a quoted space-separated list to reduce compilation time and binary size. Table of compute capabilities of NVIDIA GPUs can be found here. Examples:- NVidia A100:
-DGPU_ARCHS="80"
- Tesla T4, GeForce RTX 2080:
-DGPU_ARCHS="75"
- Titan V, Tesla V100:
-DGPU_ARCHS="70"
- Multiple SMs:
-DGPU_ARCHS="80 75"
- NVidia A100:
TRT_PLATFORM_ID
: Bare-metal build (unlike containerized cross-compilation). Currently supported options:x86_64
(default).
References
TensorRT Resources
- TensorRT Developer Home
- TensorRT QuickStart Guide
- TensorRT Developer Guide
- TensorRT Sample Support Guide
- TensorRT ONNX Tools
- TensorRT Discussion Forums
- TensorRT Release Notes
Known Issues
- Please refer to TensorRT Release Notes
Top Related Projects
An Open Source Machine Learning Framework for Everyone
Open standard for machine learning interoperability
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Open deep learning compiler stack for cpu, gpu and specialized accelerators
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
mlpack: a fast, header-only C++ machine learning library
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot