TensorRT-LLM: NVIDIA's Open-Source Library for Optimized LLM Inference

TensorRT-LLM is NVIDIA's open-source library for optimizing LLM and Visual Gen inference on NVIDIA GPUs with state-of-the-art kernels and quantization.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 5 min read

Deploying large language models in production requires more than just loading weights onto a GPU. To achieve acceptable throughput and latency, you need kernel fusion, attention optimization, memory management, and quantization – all tuned for your specific hardware. NVIDIA’s TensorRT-LLM provides all of this in a single open-source library that extracts maximum performance from NVIDIA GPUs for LLM and visual generation inference.

TensorRT-LLM, hosted at github.com/NVIDIA/TensorRT-LLM, is NVIDIA’s official inference optimization library for large language models and visual generative models. It includes state-of-the-art kernel implementations for attention (FlashAttention, PageAttention), quantization (FP8, INT4, INT8, INT4-AWQ), and in-flight batching. The library compiles models into optimized engine files that run efficiently across NVIDIA’s GPU lineup from Turing to Blackwell architectures.

The library has become the standard backend for many open-source LLM serving frameworks, including TensorRT-LLM Backend for Triton Inference Server and LangChain integrations. Its popularity stems from consistently delivering the best latency and throughput numbers on NVIDIA hardware, often outperforming naive PyTorch implementations by 3-5x on the same GPU.

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA’s open-source library for optimizing LLM and visual generative model inference on NVIDIA GPUs. It provides a Python API for model compilation, graph optimization, and runtime execution. The library supports over 30 model architectures and includes specialized kernels that maximize GPU utilization for transformer-based models.

Which GPUs does TensorRT-LLM support?

TensorRT-LLM supports NVIDIA GPUs with compute capability 7.0 and higher, covering several generations:

GPU Generation	Compute Capability	Examples
Turing	SM 7.5	T4, RTX 2080
Ampere	SM 8.0, 8.6	A100, A10, RTX 3090
Ada Lovelace	SM 8.9	RTX 4090, L40S
Hopper	SM 9.0	H100, H200
Blackwell	SM 10.x	B100, B200

Each generation gets progressively better support for quantization types and kernel optimizations. Hopper and Blackwell GPUs support FP8 inference and advanced attention kernels.

What quantization methods does TensorRT-LLM support?

TensorRT-LLM supports the widest range of quantization methods of any inference library.

Method	Precision	Memory Savings	Preferred Hardware
FP8	8-bit float	2x vs FP16	Hopper, Blackwell
INT8	8-bit integer	2x vs FP16	All SM 7.0+
INT4	4-bit integer	4x vs FP16	All SM 7.0+
INT4-AWQ	4-bit + AWQ	4x vs FP16	All SM 7.0+
INT4-GPTQ	4-bit + GPTQ	4x vs FP16	All SM 7.0+
FP4	4-bit float	4x vs FP16	Blackwell
NF4	4-bit normal float	4x vs FP16	All SM 7.0+

Quantization is performed during the model compilation step, with calibration datasets used to determine optimal quantization ranges.

Which models are supported?

TensorRT-LLM supports 30+ model architectures, including all major open-source LLMs and visual generation models.

Model	Architecture	Quantization Support
LLaMA / Llama 2 / Llama 3	Decoder-only	FP8, INT8, INT4, AWQ
Mistral / Mixtral	Decoder-only, MoE	FP8, INT8, INT4
Qwen / Qwen2	Decoder-only	INT8, INT4, AWQ
DeepSeek V2/V3	MoE, Multi-head Latent Attention	INT8, INT4
Nemotron	Decoder-only	FP8, INT4
Stable Diffusion 3	Diffusion	FP8, INT8
FLUX	Diffusion	FP8

Support for new models is added rapidly, often within weeks of their open-source release.

What is the latest version of TensorRT-LLM?

As of early 2026, TensorRT-LLM’s latest major release is version 0.18.x. This version added support for Blackwell GPUs (B100, B200), improved FP4 quantization kernels, multi-node tensor parallelism for models exceeding single-node capacity, and enhanced support for MoE (Mixture of Experts) architectures like Mixtral and DeepSeek V3. The project maintains a rapid release cadence, shipping approximately one minor version per month.

Frequently Asked Questions

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA’s open-source library for optimizing LLM and visual generation model inference on NVIDIA GPUs. It compiles models into optimized engines using kernel fusion, memory optimization, and quantization.

Which GPUs does TensorRT-LLM support?

All NVIDIA GPUs with compute capability 7.0+ (Turing, Ampere, Ada Lovelace, Hopper, and Blackwell). FP8 inference requires Hopper or newer. FP4 requires Blackwell.

What quantization methods are supported?

FP8 (Hopper+), INT8, INT4, INT4-AWQ, INT4-GPTQ, FP4 (Blackwell), and NF4. Quantization is performed during the model compilation step with calibration.

Which models are supported?

Over 30 architectures including LLaMA 3, Mistral, Mixtral, Qwen 2, DeepSeek V2/V3, Nemotron, and diffusion models like Stable Diffusion 3 and FLUX.

How does TensorRT-LLM compare to other inference backends?

TensorRT-LLM consistently delivers 3-5x better throughput than naive PyTorch inference on the same GPU. It is the standard backend for Triton Inference Server and is widely used in production LLM deployments.

TensorRT-LLM: NVIDIA's Open-Source Library for Optimized LLM Inference

What is TensorRT-LLM?

Which GPUs does TensorRT-LLM support?

What quantization methods does TensorRT-LLM support?

Which models are supported?

What is the latest version of TensorRT-LLM?

Frequently Asked Questions

What is TensorRT-LLM?

Which GPUs does TensorRT-LLM support?

What quantization methods are supported?

Which models are supported?

How does TensorRT-LLM compare to other inference backends?

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES