TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

10,410

1,404

10,410

776

View on GitHub

Top Related Projects

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

FasterTransformer

6,237

Transformer related optimization, including BERT, GPT

optimum

2,960

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

onnx

18,872

Open standard for machine learning interoperability

onnxruntime

16,412

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

inference

1,403

Reference implementations of MLPerf™ inference benchmarks

Quick Overview

TensorRT-LLM is an open-source library developed by NVIDIA for optimizing and deploying large language models (LLMs) using NVIDIA GPUs. It leverages TensorRT, NVIDIA's high-performance deep learning inference SDK, to accelerate LLM inference and reduce latency for various applications.

Pros

Significantly improves inference speed and reduces latency for LLMs
Supports popular LLM architectures like GPT, BERT, and T5
Provides optimizations for both single-GPU and multi-GPU/multi-node deployments
Offers integration with popular frameworks like PyTorch and Hugging Face Transformers

Cons

Limited to NVIDIA GPUs, not usable on other hardware platforms
Requires expertise in NVIDIA's ecosystem and tools for optimal use
May have a steeper learning curve compared to more general-purpose libraries
Documentation and community support might be less extensive than more established frameworks

Code Examples

Building and running a GPT model:

import tensorrt_llm
from tensorrt_llm.models import GPTLMHeadModel

# Initialize the model
model = GPTLMHeadModel.from_pretrained("gpt2")

# Compile the model for TensorRT
engine = tensorrt_llm.build_engine(model)

# Run inference
output = engine.infer("Hello, world!")

Quantizing a model for improved performance:

from tensorrt_llm.quantization import QuantMode

# Initialize the model with INT8 quantization
model = GPTLMHeadModel.from_pretrained("gpt2", quant_mode=QuantMode.INT8)

# Build the engine with quantization
engine = tensorrt_llm.build_engine(model, precision="int8")

Multi-GPU inference:

import tensorrt_llm
from tensorrt_llm.models import GPTLMHeadModel

# Initialize the model for multi-GPU
model = GPTLMHeadModel.from_pretrained("gpt2", device_map="auto")

# Build engines for multiple GPUs
engines = tensorrt_llm.build_engines(model, num_gpus=4)

# Run inference across multiple GPUs
output = tensorrt_llm.infer_multi_gpu(engines, "Hello, world!")

Getting Started

To get started with TensorRT-LLM:

Install TensorRT and CUDA:

pip install nvidia-tensorrt

Install TensorRT-LLM:

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -e .

Use TensorRT-LLM in your Python code:

import tensorrt_llm
from tensorrt_llm.models import GPTLMHeadModel

model = GPTLMHeadModel.from_pretrained("gpt2")
engine = tensorrt_llm.build_engine(model)
output = engine.infer("Hello, TensorRT-LLM!")
print(output)

Competitor Comparisons

DeepSpeed

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

More flexible and framework-agnostic, supporting PyTorch, TensorFlow, and other deep learning frameworks
Offers a wider range of optimization techniques beyond inference, including training optimizations
Provides more extensive documentation and community support

Cons of DeepSpeed

May have slightly lower performance for specific NVIDIA hardware compared to TensorRT-LLM
Requires more manual configuration and tuning to achieve optimal performance
Less specialized for LLM inference, which can result in more complex setup for certain use cases

Code Comparison

DeepSpeed:

import deepspeed
model = deepspeed.init_inference(model, mp_size=2, dtype=torch.float16)
output = model(input_ids)

TensorRT-LLM:

import tensorrt_llm
engine = tensorrt_llm.runtime.Engine("path/to/engine")
output = engine.infer(input_ids)

Both libraries aim to optimize large language model inference, but DeepSpeed offers a more general-purpose solution with broader framework support, while TensorRT-LLM focuses on maximizing performance specifically for NVIDIA GPUs. The code snippets demonstrate that DeepSpeed integrates more closely with existing model objects, while TensorRT-LLM uses a separate engine for inference.

FasterTransformer

6,237

Transformer related optimization, including BERT, GPT

Pros of FasterTransformer

More mature and established project with a longer history
Supports a wider range of model architectures and use cases
Offers more flexibility in terms of customization and fine-tuning

Cons of FasterTransformer

Less optimized for specific LLM inference tasks
May require more manual configuration and setup
Potentially slower inference speed for certain LLM models

Code Comparison

FasterTransformer:

fastertransformer::Allocator<AllocatorType::CUDA> allocator(0);
fastertransformer::GptJ<half> gpt(
    max_batch_size, max_seq_len, head_num, size_per_head,
    inter_size, num_layer, vocab_size, start_id, end_id,
    &allocator, false, stream, cublas_wrapper, false);

TensorRT-LLM:

builder = Builder()
network = builder.create_network()
config = builder.create_builder_config()
model = GPTJForCausalLM(config)
engine = builder.build_engine(network, config)

The code snippets demonstrate the initialization process for each library. FasterTransformer uses C++ and requires more detailed configuration, while TensorRT-LLM offers a more high-level Python API for model creation and engine building.

optimum

2,960

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

Pros of Optimum

Broader model support across various frameworks (PyTorch, TensorFlow, JAX)
Easier integration with Hugging Face ecosystem and model hub
More extensive documentation and community support

Cons of Optimum

Less specialized for NVIDIA hardware optimization
May not achieve the same level of performance as TensorRT-LLM for NVIDIA GPUs
Potentially more complex setup for specific NVIDIA optimizations

Code Comparison

Optimum:

from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)

TensorRT-LLM:

import tensorrt_llm
engine = tensorrt_llm.runtime.Engine("path/to/engine.plan")
model = tensorrt_llm.runtime.GenerationSession(engine)

Both repositories aim to optimize and accelerate machine learning models, but they have different focuses. Optimum provides a more general-purpose optimization toolkit for various frameworks, while TensorRT-LLM specializes in optimizing large language models for NVIDIA hardware. The choice between them depends on specific hardware requirements, model types, and integration needs within existing workflows.

onnx

18,872

Open standard for machine learning interoperability

Pros of ONNX

Broader ecosystem support and compatibility across various frameworks and hardware
More extensive model coverage, supporting a wide range of AI and ML models
Active community and contributions from multiple major tech companies

Cons of ONNX

Less optimized for NVIDIA GPUs compared to TensorRT-LLM
May require additional steps or tools for deployment on specific hardware
Potentially slower inference performance for certain models on NVIDIA hardware

Code Comparison

TensorRT-LLM:

import tensorrt_llm
model = tensorrt_llm.Builder().create_network()
# ... model definition ...
engine = tensorrt_llm.Builder().build_engine(model)

ONNX:

import onnx
model = onnx.ModelProto()
# ... model definition ...
onnx.save(model, "model.onnx")

Summary

While ONNX offers broader compatibility and ecosystem support, TensorRT-LLM provides more optimized performance for NVIDIA GPUs, especially for large language models. ONNX is more versatile across different frameworks and hardware, but may require additional optimization steps for specific deployments. TensorRT-LLM focuses on high-performance inference for LLMs on NVIDIA GPUs, offering potentially faster execution for these specific use cases.

onnxruntime

16,412

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Pros of ONNX Runtime

Broader hardware support, including CPUs and various accelerators
More extensive ecosystem and integration with other ML frameworks
Easier deployment across different platforms and devices

Cons of ONNX Runtime

May have lower performance for specific NVIDIA GPU optimizations
Less specialized for large language models compared to TensorRT-LLM
Potentially more complex setup for high-performance LLM inference

Code Comparison

TensorRT-LLM:

import tensorrt_llm
model = tensorrt_llm.models.LLaMAForCausalLM.from_pretrained("path/to/model")
output = model.generate("Hello, how are you?")

ONNX Runtime:

import onnxruntime as ort
session = ort.InferenceSession("path/to/model.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: "Hello, how are you?"})

Both repositories aim to optimize inference for machine learning models, but TensorRT-LLM focuses specifically on large language models and NVIDIA GPUs, while ONNX Runtime provides a more general-purpose solution for various hardware platforms.

inference

1,403

Reference implementations of MLPerf™ inference benchmarks

Pros of MLCommons Inference

Broader scope: Covers a wide range of ML tasks and models, not limited to LLMs
Vendor-neutral: Supports multiple hardware platforms and frameworks
Established benchmarking standard: Widely recognized in the industry

Cons of MLCommons Inference

Less specialized: May not offer optimizations specific to LLMs
Higher complexity: Requires more setup and configuration for specific use cases
Slower development cycle: Updates may be less frequent due to broader focus

Code Comparison

TensorRT-LLM example:

import tensorrt_llm
from tensorrt_llm.models import GPTLMHeadModel

model = GPTLMHeadModel.from_pretrained("gpt2")
engine = model.to_engine()

MLCommons Inference example:

from mlperf_inference_src.loadgen import *

settings = TestSettings()
settings.scenario = Scenario.SingleStream
settings.mode = TestMode.PerformanceOnly
lg = ConstructLoadGenerator(settings)

Summary

TensorRT-LLM focuses on optimizing LLMs for NVIDIA hardware, offering specialized tools and potentially better performance for specific use cases. MLCommons Inference provides a more comprehensive benchmarking framework across various ML tasks and hardware platforms, but may lack the specialized optimizations for LLMs that TensorRT-LLM offers.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Architecture | Performance | Examples | Documentation | Roadmap

Tech Blogs

[06/19] Disaggregated Serving in TensorRT-LLM â¨ â¡ï¸ link
[06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) â¨ â¡ï¸ link
[05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers â¨ â¡ï¸ link
[05/23] DeepSeek R1 MTP Implementation and Optimization â¨ â¡ï¸ link
[05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs â¨ â¡ï¸ link

Latest News

[06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 â¨ â¡ï¸ link
[05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Metaâs Llama 4 Maverick â¨ â¡ï¸ link
[04/10] TensorRT-LLM DeepSeek R1 performance benchmarking best practices now published. â¨ â¡ï¸ link
[04/05] TensorRT-LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

L4_perf

[03/22] TensorRT-LLM is now fully open-source, with developments moved to GitHub!
[03/18] ðð NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT-LLM â¡ï¸ Link
[02/28] ð NAVER Place Optimizes SLM-Based Vertical Services with TensorRT-LLM â¡ï¸ Link
[02/25] ð DeepSeek-R1 performance now optimized for Blackwell â¡ï¸ Link
[02/20] Explore the complete guide to achieve great accuracy, high throughput, and low latency at the lowest cost for your business here.
[02/18] Unlock #LLM inference with auto-scaling on @AWS EKS â¨ â¡ï¸ link
[02/12] ð¦¸â¡ Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling â¡ï¸ link
[02/12] ð How Scaling Laws Drive Smarter, More Powerful AI â¡ï¸ link

Previous News

[2025/01/25] Nvidia moves AI focus to inference cost, efficiency â¡ï¸ link
[2025/01/24] ðï¸ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions â¡ï¸ link
[2025/01/23] ð Fast, Low-Cost Inference Offers Key to Profitable AI â¡ï¸ link
[2025/01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM â¡ï¸ link
[2025/01/14] ð£ Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM â¡ï¸ link
[2025/01/04] â¡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding â¡ï¸ link
[2024/12/10] â¡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. ð State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview â¡ï¸ link
[2024/12/03] ð Boost your AI inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. â¡Learn how in this technical deep dive â¡ï¸ link
[2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer â¡ A user-friendly GUI and tight integration with NVIDIA TensorRT that offers: â Intuitive visualization of ONNX model graphs â Quick tweaking of model architecture and parameters â Detailed performance profiling with either ORT or TensorRT â Easy building of TensorRT engines â¡ï¸ link
[2024/11/26] ð£ Introducing TensorRT-LLM for Jetson AGX Orin, making it even easier to deploy on Jetson AGX Orin with initial support in JetPack 6.1 via the v0.12.0-jetson branch of the TensorRT-LLM repo. â Pre-compiled TensorRT-LLM wheels & containers for easy integration â Comprehensive guides & docs to get you started â¡ï¸ link
[2024/11/21] NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 â¡ï¸ link
[2024/11/19] Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs â¡ï¸ link
[2024/11/09] ððð 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot â¡ï¸ link
[2024/11/09] â¨ NVIDIA advances the AI ecosystem with the AI model of LG AI Research ð â¡ï¸ link
[2024/11/02] ððð NVIDIA and LlamaIndex Developer Contest ð Enter for a chance to win prizes including an NVIDIAÂ® GeForce RTXâ¢ 4080 SUPER GPU, DLI credits, and moreð â¡ï¸ link
[2024/10/28] ðï¸ðï¸ðï¸ NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models â¡ï¸ link
[2024/10/22] New ð Step-by-step instructions on how to â Optimize LLMs with NVIDIA TensorRT-LLM, â Deploy the optimized models with Triton Inference Server, â Autoscale LLMs deployment in a Kubernetes environment. ð Technical Deep Dive: â¡ï¸ link
[2024/10/07] ðððOptimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries â¡ï¸ link
[2024/09/29] ð AI at Meta PyTorch + TensorRT v2.4 ð â¡TensorRT 10.1 â¡PyTorch 2.4 â¡CUDA 12.4 â¡Python 3.12 â¡ï¸ link
[2024/09/17] â¨ NVIDIA TensorRT-LLM Meetup â¡ï¸ link
[2024/09/17] â¨ Accelerating LLM Inference at Databricks with TensorRT-LLM â¡ï¸ link
[2024/09/17] â¨ TensorRT-LLM @ Baseten â¡ï¸ link
[2024/09/04] ðï¸ðï¸ðï¸ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML â¡ï¸ link
[2024/08/20] ðï¸SDXL with #TensorRT Model Optimizer â±ï¸â¡ ð cache diffusion ð quantization aware training ð QLoRA ð #Python 3.12 â¡ï¸ link
[2024/08/13] ð DIY Code Completion with #Mamba â¡ #TensorRT #LLM for speed ð¤ NIM for ease âï¸ deploy anywhere â¡ï¸ link
[2024/08/06] ð« Multilingual Challenge Accepted ð« ð¤ #TensorRT #LLM boosts low-resource languages like Hebrew, Indonesian and Vietnamese â¡â¡ï¸ link
[2024/07/30] Introducingð @SliceXAI ELM Turbo ð¤ train ELM once â¡ #TensorRT #LLM optimize âï¸ deploy anywhere â¡ï¸ link
[2024/07/23] ð @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized â¡ ð¦ 400 tok/s - per node ð¦ 37 tok/s - per user ð¦ 1 node inference â¡ï¸ link
[2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference: â MultiLingual â NIM â LoRA tuned adaptorsâ¡ï¸ Tech blog
[2024/07/02] Let the @MistralAI MoE tokens fly ð ð #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100. â¡ï¸ Tech blog
[2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.aiâs solar-10.7B-instruct is ready to power your developer projects through our API catalog ðï¸. â¨â¡ï¸ link
[2024/06/18] CYMI: ð¤© Stable Diffusion 3 dropped last week ð ðï¸ Speed up your SD3 with #TensorRT INT8 Quantizationâ¡ï¸ link
[2024/06/18] ð§°Deploying ComfyUI with TensorRT? Hereâs your setup guide â¡ï¸ link
[2024/06/11] â¨#TensorRT Weight-Stripped Engines â¨ Technical Deep Dive for serious coders â+99% compression â1 set of weights â ** GPUs â0 performance loss â** modelsâ¦LLM, CNN, etc.â¡ï¸ link
[2024/06/04] â¨ #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers ð¦¸â¡ ð¥ Demo: â¡ï¸ link ð DIY notebook: â¡ï¸ link
[2024/05/28] â¨#TensorRT weight stripping for ResNet-50 â¨ â+99% compression â1 set of weights â ** GPUs\ â0 performance loss â** modelsâ¦LLM, CNN, etc ð ð DIY â¡ï¸ link
[2024/05/21] â¨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM â¨ð ð Marvelous Modal Manual: Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs â¡ï¸ link
[2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques âquantization âsparsity âQAT â¡ï¸ blog
[2024/05/07] ð¦ð¦ð¦ 24,000 tokens per second ð«Meta Llama 3 takes off with #TensorRT #LLM ðâ¡ï¸ link
[2024/02/06] ð Speed up inference with SOTA quantization techniques in TRT-LLM
[2024/01/30] New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
[2023/12/04] Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
[2023/11/27] SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version
[2023/11/13] H200 achieves nearly 12,000 tok/sec on Llama2-13B
[2023/10/22] ð RAG on Windows using TensorRT-LLM and LlamaIndex ð¦
[2023/10/19] Getting Started Guide - Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
[2023/10/17] Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows

TensorRT-LLM Overview

TensorRT-LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant, ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.

Architected on PyTorch, TensorRT-LLM provides a high-level Python LLM API that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo and the Triton Inference Server.

TensorRT-LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using native PyTorch code, making it easy to adapt the system to specific needs.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Useful Links

Quantized models on Hugging Face: A growing collection of quantized (e.g., FP8, FP4) and optimized LLMs, including DeepSeek FP4, ready for fast inference with TensorRT-LLM.
NVIDIA Dynamo: A datacenter scale distributed inference serving framework that works seamlessly with TensorRT-LLM.
AutoDeploy: An experimental backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models.
WeChat Discussion Group: A real-time channel for TensorRT-LLM Q&A and news.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot