AI

ExLlamaV3: High-Performance LLM Inference Engine

ExLlamaV3 is a high-performance inference engine for Llama and EXL3-quantized models, optimized for maximum throughput on consumer GPUs.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
ExLlamaV3: High-Performance LLM Inference Engine

Running large language models on consumer hardware requires efficient inference engines that squeeze every drop of performance from available GPU memory. ExLlamaV3, developed by the turboderp team, is one of the fastest inference engines available for Llama-family models, particularly when using the EXL3 quantization format.

ExLlamaV3 achieves its speed through a combination of optimized CUDA kernels, efficient memory management, and quantization-aware computation. It supports both 4-bit and 8-bit EXL3 quantization, dynamic batching, and speculative decoding. For users running local models on consumer GPUs, it consistently delivers the highest tokens-per-second throughput available.

Performance Benchmarks

ModelGPUQuantizationSpeed (tokens/s)Memory Usage
Llama 3.1 8BRTX 4090 24GBEXL3 4-bit1806 GB
Llama 3.1 70BRTX 4090 24GBEXL3 4-bit3022 GB
Mistral 7BRTX 3060 12GBEXL3 4-bit855 GB
Qwen 2.5 32BRTX 4090 24GBEXL3 4-bit5518 GB

Key Features

FeatureDescriptionBenefit
EXL3 quantizationSpecialized 4-bit and 8-bit formatsHighest quality per bit
CUDA kernel optimizationFused attention, flash decodingMaximum throughput
Dynamic batchingProcess multiple requests concurrentlyHigher utilization
Speculative decodingDraft-then-verify for faster generation2x speedup on some tasks
LoRA supportLoad and swap LoRA adapters at runtimeFlexible fine-tuning

Inference Pipeline

The pipeline processes tokens through transformer layers with specialized CUDA kernels for attention and feed-forward computation. The KV cache is maintained efficiently in GPU memory, and speculative decoding can accelerate generation by validating multiple tokens at once.

Inference Engine Comparison

FeatureExLlamaV3llama.cppvLLMTransformers
GPU supportFull (CUDA)Partial (CUDA/Metal)Full (CUDA)Full (CUDA)
QuantizationEXL3 onlyGGUFAWQ/GPTQBitsAndBytes
Batch inferenceYesLimitedYesYes
Speed (8B)180 t/s120 t/s160 t/s40 t/s
API serverBuilt-inVia llama-serverBuilt-inVia TGI

For more information, visit the ExLlamaV3 GitHub repository and the EXL3 quantization specification.

Frequently Asked Questions

Q: What GPU do I need to run ExLlamaV3? A: Any NVIDIA GPU with CUDA support and at least 6GB VRAM for 7B models.

Q: Can ExLlamaV3 run on AMD GPUs? A: Currently limited to NVIDIA CUDA. AMD ROCm support is in development.

Q: How does EXL3 compare to GGUF quantization? A: EXL3 typically offers higher accuracy at the same bitrate and faster inference on GPU.

Q: Does ExLlamaV3 support multi-GPU inference? A: Yes, it supports tensor parallelism across multiple GPUs for larger models.

Q: Can I use LoRA adapters with ExLlamaV3? A: Yes, LoRA adapters can be loaded and swapped without reloading the base model.

TAG
CATEGORIES