AI

TensorRT-LLM: NVIDIA's Open-Source Library for Optimized LLM Inference

TensorRT-LLM is NVIDIA's open-source library for optimizing LLM and Visual Gen inference on NVIDIA GPUs with state-of-the-art kernels and quantization.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
TensorRT-LLM: NVIDIA's Open-Source Library for Optimized LLM Inference

Deploying large language models in production requires more than just loading weights onto a GPU. To achieve acceptable throughput and latency, you need kernel fusion, attention optimization, memory management, and quantization – all tuned for your specific hardware. NVIDIA’s TensorRT-LLM provides all of this in a single open-source library that extracts maximum performance from NVIDIA GPUs for LLM and visual generation inference.

TensorRT-LLM, hosted at github.com/NVIDIA/TensorRT-LLM, is NVIDIA’s official inference optimization library for large language models and visual generative models. It includes state-of-the-art kernel implementations for attention (FlashAttention, PageAttention), quantization (FP8, INT4, INT8, INT4-AWQ), and in-flight batching. The library compiles models into optimized engine files that run efficiently across NVIDIA’s GPU lineup from Turing to Blackwell architectures.

The library has become the standard backend for many open-source LLM serving frameworks, including TensorRT-LLM Backend for Triton Inference Server and LangChain integrations. Its popularity stems from consistently delivering the best latency and throughput numbers on NVIDIA hardware, often outperforming naive PyTorch implementations by 3-5x on the same GPU.

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA’s open-source library for optimizing LLM and visual generative model inference on NVIDIA GPUs. It provides a Python API for model compilation, graph optimization, and runtime execution. The library supports over 30 model architectures and includes specialized kernels that maximize GPU utilization for transformer-based models.

Which GPUs does TensorRT-LLM support?

TensorRT-LLM supports NVIDIA GPUs with compute capability 7.0 and higher, covering several generations:

GPU GenerationCompute CapabilityExamples
TuringSM 7.5T4, RTX 2080
AmpereSM 8.0, 8.6A100, A10, RTX 3090
Ada LovelaceSM 8.9RTX 4090, L40S
HopperSM 9.0H100, H200
BlackwellSM 10.xB100, B200

Each generation gets progressively better support for quantization types and kernel optimizations. Hopper and Blackwell GPUs support FP8 inference and advanced attention kernels.

What quantization methods does TensorRT-LLM support?

TensorRT-LLM supports the widest range of quantization methods of any inference library.

MethodPrecisionMemory SavingsPreferred Hardware
FP88-bit float2x vs FP16Hopper, Blackwell
INT88-bit integer2x vs FP16All SM 7.0+
INT44-bit integer4x vs FP16All SM 7.0+
INT4-AWQ4-bit + AWQ4x vs FP16All SM 7.0+
INT4-GPTQ4-bit + GPTQ4x vs FP16All SM 7.0+
FP44-bit float4x vs FP16Blackwell
NF44-bit normal float4x vs FP16All SM 7.0+

Quantization is performed during the model compilation step, with calibration datasets used to determine optimal quantization ranges.

Which models are supported?

TensorRT-LLM supports 30+ model architectures, including all major open-source LLMs and visual generation models.

ModelArchitectureQuantization Support
LLaMA / Llama 2 / Llama 3Decoder-onlyFP8, INT8, INT4, AWQ
Mistral / MixtralDecoder-only, MoEFP8, INT8, INT4
Qwen / Qwen2Decoder-onlyINT8, INT4, AWQ
DeepSeek V2/V3MoE, Multi-head Latent AttentionINT8, INT4
NemotronDecoder-onlyFP8, INT4
Stable Diffusion 3DiffusionFP8, INT8
FLUXDiffusionFP8

Support for new models is added rapidly, often within weeks of their open-source release.

What is the latest version of TensorRT-LLM?

As of early 2026, TensorRT-LLM’s latest major release is version 0.18.x. This version added support for Blackwell GPUs (B100, B200), improved FP4 quantization kernels, multi-node tensor parallelism for models exceeding single-node capacity, and enhanced support for MoE (Mixture of Experts) architectures like Mixtral and DeepSeek V3. The project maintains a rapid release cadence, shipping approximately one minor version per month.

Frequently Asked Questions

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA’s open-source library for optimizing LLM and visual generation model inference on NVIDIA GPUs. It compiles models into optimized engines using kernel fusion, memory optimization, and quantization.

Which GPUs does TensorRT-LLM support?

All NVIDIA GPUs with compute capability 7.0+ (Turing, Ampere, Ada Lovelace, Hopper, and Blackwell). FP8 inference requires Hopper or newer. FP4 requires Blackwell.

What quantization methods are supported?

FP8 (Hopper+), INT8, INT4, INT4-AWQ, INT4-GPTQ, FP4 (Blackwell), and NF4. Quantization is performed during the model compilation step with calibration.

Which models are supported?

Over 30 architectures including LLaMA 3, Mistral, Mixtral, Qwen 2, DeepSeek V2/V3, Nemotron, and diffusion models like Stable Diffusion 3 and FLUX.

How does TensorRT-LLM compare to other inference backends?

TensorRT-LLM consistently delivers 3-5x better throughput than naive PyTorch inference on the same GPU. It is the standard backend for Triton Inference Server and is widely used in production LLM deployments.

Further Reading

TAG
CATEGORIES