Deploying large language models in production requires more than just loading weights onto a GPU. To achieve acceptable throughput and latency, you need kernel fusion, attention optimization, memory management, and quantization – all tuned for your specific hardware. NVIDIA’s TensorRT-LLM provides all of this in a single open-source library that extracts maximum performance from NVIDIA GPUs for LLM and visual generation inference.
TensorRT-LLM, hosted at github.com/NVIDIA/TensorRT-LLM, is NVIDIA’s official inference optimization library for large language models and visual generative models. It includes state-of-the-art kernel implementations for attention (FlashAttention, PageAttention), quantization (FP8, INT4, INT8, INT4-AWQ), and in-flight batching. The library compiles models into optimized engine files that run efficiently across NVIDIA’s GPU lineup from Turing to Blackwell architectures.
The library has become the standard backend for many open-source LLM serving frameworks, including TensorRT-LLM Backend for Triton Inference Server and LangChain integrations. Its popularity stems from consistently delivering the best latency and throughput numbers on NVIDIA hardware, often outperforming naive PyTorch implementations by 3-5x on the same GPU.
What is TensorRT-LLM?
TensorRT-LLM is NVIDIA’s open-source library for optimizing LLM and visual generative model inference on NVIDIA GPUs. It provides a Python API for model compilation, graph optimization, and runtime execution. The library supports over 30 model architectures and includes specialized kernels that maximize GPU utilization for transformer-based models.
Which GPUs does TensorRT-LLM support?
TensorRT-LLM supports NVIDIA GPUs with compute capability 7.0 and higher, covering several generations:
| GPU Generation | Compute Capability | Examples |
|---|---|---|
| Turing | SM 7.5 | T4, RTX 2080 |
| Ampere | SM 8.0, 8.6 | A100, A10, RTX 3090 |
| Ada Lovelace | SM 8.9 | RTX 4090, L40S |
| Hopper | SM 9.0 | H100, H200 |
| Blackwell | SM 10.x | B100, B200 |
Each generation gets progressively better support for quantization types and kernel optimizations. Hopper and Blackwell GPUs support FP8 inference and advanced attention kernels.
What quantization methods does TensorRT-LLM support?
TensorRT-LLM supports the widest range of quantization methods of any inference library.
| Method | Precision | Memory Savings | Preferred Hardware |
|---|---|---|---|
| FP8 | 8-bit float | 2x vs FP16 | Hopper, Blackwell |
| INT8 | 8-bit integer | 2x vs FP16 | All SM 7.0+ |
| INT4 | 4-bit integer | 4x vs FP16 | All SM 7.0+ |
| INT4-AWQ | 4-bit + AWQ | 4x vs FP16 | All SM 7.0+ |
| INT4-GPTQ | 4-bit + GPTQ | 4x vs FP16 | All SM 7.0+ |
| FP4 | 4-bit float | 4x vs FP16 | Blackwell |
| NF4 | 4-bit normal float | 4x vs FP16 | All SM 7.0+ |
Quantization is performed during the model compilation step, with calibration datasets used to determine optimal quantization ranges.
Which models are supported?
TensorRT-LLM supports 30+ model architectures, including all major open-source LLMs and visual generation models.
| Model | Architecture | Quantization Support |
|---|---|---|
| LLaMA / Llama 2 / Llama 3 | Decoder-only | FP8, INT8, INT4, AWQ |
| Mistral / Mixtral | Decoder-only, MoE | FP8, INT8, INT4 |
| Qwen / Qwen2 | Decoder-only | INT8, INT4, AWQ |
| DeepSeek V2/V3 | MoE, Multi-head Latent Attention | INT8, INT4 |
| Nemotron | Decoder-only | FP8, INT4 |
| Stable Diffusion 3 | Diffusion | FP8, INT8 |
| FLUX | Diffusion | FP8 |
Support for new models is added rapidly, often within weeks of their open-source release.
What is the latest version of TensorRT-LLM?
As of early 2026, TensorRT-LLM’s latest major release is version 0.18.x. This version added support for Blackwell GPUs (B100, B200), improved FP4 quantization kernels, multi-node tensor parallelism for models exceeding single-node capacity, and enhanced support for MoE (Mixture of Experts) architectures like Mixtral and DeepSeek V3. The project maintains a rapid release cadence, shipping approximately one minor version per month.
Frequently Asked Questions
What is TensorRT-LLM?
TensorRT-LLM is NVIDIA’s open-source library for optimizing LLM and visual generation model inference on NVIDIA GPUs. It compiles models into optimized engines using kernel fusion, memory optimization, and quantization.
Which GPUs does TensorRT-LLM support?
All NVIDIA GPUs with compute capability 7.0+ (Turing, Ampere, Ada Lovelace, Hopper, and Blackwell). FP8 inference requires Hopper or newer. FP4 requires Blackwell.
What quantization methods are supported?
FP8 (Hopper+), INT8, INT4, INT4-AWQ, INT4-GPTQ, FP4 (Blackwell), and NF4. Quantization is performed during the model compilation step with calibration.
Which models are supported?
Over 30 architectures including LLaMA 3, Mistral, Mixtral, Qwen 2, DeepSeek V2/V3, Nemotron, and diffusion models like Stable Diffusion 3 and FLUX.
How does TensorRT-LLM compare to other inference backends?
TensorRT-LLM consistently delivers 3-5x better throughput than naive PyTorch inference on the same GPU. It is the standard backend for Triton Inference Server and is widely used in production LLM deployments.
Further Reading
- TensorRT-LLM GitHub Repository
- NVIDIA TensorRT-LLM Documentation
- Triton Inference Server with TensorRT-LLM Backend
- FlashAttention: Fast and Memory-Efficient Exact Attention
- FP8 Formats for Deep Learning
flowchart LR
A[Model Weights] --> B[TensorRT-LLM Compiler]
C[Calibration Data] --> B
B --> D{Optimization Passes}
D --> E[Kernel Fusion]
D --> F[Attention Optimization]
D --> G[Quantization]
D --> H[Memory Planning]
E --> I[Optimized Engine]
F --> I
G --> I
H --> I
I --> J[Runtime Execution]
J --> K[Inference Results]graph TD
subgraph Performance Scaling
A[FP16 Baseline] --> B[1x throughput]
C[INT8 TensorRT-LLM] --> D[2.5x throughput]
E[INT4 TensorRT-LLM] --> F[4x throughput]
G[FP8 Hopper TensorRT-LLM] --> H[3x throughput]
end
subgraph GPU Memory
I[70B model FP16] --> J[140GB needed]
K[70B model INT8] --> L[70GB needed]
M[70B model INT4] --> N[35GB needed]
end
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!