AI

llama.cpp: High-Performance LLM Inference on CPU and GPU

llama.cpp is a high-performance C++ implementation for running LLMs locally on CPU and GPU with quantization, supporting hundreds of models.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
llama.cpp: High-Performance LLM Inference on CPU and GPU

The dream of running powerful language models entirely on your own hardware, without sending data to cloud APIs, was once considered impractical for anyone outside of large tech companies. llama.cpp shattered that assumption. This single-header C++ implementation has become the most popular tool for running LLMs locally, democratizing access to AI computation across virtually every hardware configuration.

Created by Georgi Gerganov, llama.cpp started as a focused implementation of Meta’s Llama architecture and has since grown into a universal inference engine supporting hundreds of model architectures, multiple backends (CPU, CUDA, Metal, ROCm, Vulkan), and a rich ecosystem of tools and integrations.

The central innovation of llama.cpp is the GGUF format and its quantization system. By representing model weights in reduced precision (down to 2-bit), llama.cpp can run models that would otherwise require enterprise-grade GPUs on ordinary consumer hardware. A 70B parameter model that normally needs 140GB of memory can run in just 35GB at 4-bit quantization.


How Does llama.cpp’s Architecture Work?

llama.cpp is structured as a modular inference engine with support for multiple hardware backends.

graph TD
    A[GGUF Model File] --> B[llama.cpp Inference Engine]
    B --> C[CPU Backend\nx86 with AVX2/AVX-512\nARM with NEON]
    B --> D[CUDA Backend\nNVIDIA GPU\nTensor Cores]
    B --> E[Metal Backend\nApple Silicon GPU\nUnified Memory]
    B --> F[Vulkan Backend\nCross-Platform GPU\nAMD/Intel/NVIDIA]
    C --> G[Output Tokens]
    D --> G
    E --> G
    F --> G
    B --> H[Sampling Strategies\nTemperature, Top-K, Top-P\nRepetition Penalty]
    H --> G

The engine automatically selects the best available backend and can split model layers across CPU and different GPUs for maximum throughput.


What Quantization Levels Does GGUF Support?

The quantization level directly determines the tradeoff between model quality, memory usage, and inference speed.

QuantizationBits per WeightMemory (7B model)Quality vs FP16Use Case
FP161614 GBBaselineMaximum quality, high-end GPUs
Q8_087 GBNegligible lossHigh quality, balanced memory
Q6_K65.3 GBMinimal lossGood quality, common choice
Q4_K_M44.2 GBSmall lossBest quality-to-size ratio
Q4_043.8 GBModerate lossFits 6GB GPUs
Q3_K_S33.1 GBNoticeable lossLow-memory scenarios
Q2_K22.2 GBSignificant lossAbsolute minimum memory

Q4_K_M (4-bit medium) is the most popular quantization level, offering a good balance of quality and efficiency for most use cases.


How Do You Use llama.cpp?

llama.cpp provides multiple interfaces to suit different usage patterns.

InterfaceCommand / MethodUse Case
CLI (main)./llama-cli -m model.gguf -p "Hello"Quick questions, scripting
Interactive./llama-cli -m model.gguf -iChat sessions, exploration
Server (API)./llama-server -m model.ggufWeb apps, OpenAI-compatible API
Python bindingsllama-cpp-pythonPython integration, automation
EmbeddedLibrary modeCustom applications

The server mode is particularly powerful – it exposes an OpenAI-compatible REST API, meaning any tool that works with OpenAI can be pointed at a local llama.cpp instance by simply changing the base URL.


What Are the System Requirements for Different Models?

llama.cpp can scale from a Raspberry Pi to a multi-GPU workstation.

Model SizeQuantizationMinimum RAMTypical Hardware
1B-3BQ4_K_M2-4 GBPhone, Raspberry Pi 5
7B-8BQ4_K_M6 GBLaptop, MacBook Air
13B-14BQ4_K_M10 GBDesktop, MacBook Pro
30B-34BQ4_K_M20 GBWorkstation, Mac Studio
70B-72BQ4_K_M40 GBServer, multi-GPU setup
120B+Q4_K_M70+ GBMulti-node inference

These modest requirements have made llama.cpp the backbone of the local AI movement, enabling privacy-preserving AI on personal devices.


FAQ

What is llama.cpp? llama.cpp is a high-performance C++ implementation for running large language models locally, created by Georgi Gerganov. It is optimized for both CPU and GPU inference, supports extensive model quantization via the GGUF format, and can run hundreds of open-source models on consumer hardware without Internet connectivity.

What is the GGUF format? GGUF (GPT-Generated Unified Format) is the file format developed for llama.cpp to store quantized language models. It supersedes the earlier GGML format and provides a self-contained model file that includes the model architecture, tokenizer, weights, and metadata in a single file. GGUF supports multiple quantization levels from Q2 (2-bit) to Q8 (8-bit) and various hybrid formats.

What hardware can run llama.cpp? llama.cpp is designed to run on a wide range of hardware including CPUs (with x86 and ARM optimizations, Apple Silicon), GPUs (NVIDIA CUDA, AMD ROCm, Intel Metal, Vulkan), and hybrids (splitting layers across CPU and GPU). A 7B parameter model quantized to 4-bit can run on 6GB RAM, while larger models scale according to available memory.

What models are compatible with llama.cpp? llama.cpp supports hundreds of model architectures including Llama, Mistral, Mixtral, Falcon, Gemma, Qwen, Phi, DeepSeek, Command R, DBRX, Yi, StarCoder, CodeLlama, and many more. New architectures are regularly added through community contributions. The main requirement is that the model be converted to GGUF format.

Can llama.cpp be used as a server? Yes, llama.cpp includes a built-in HTTP server that provides an OpenAI-compatible API, making it usable as a drop-in replacement for OpenAI’s API. It supports completions, chat completions, embeddings, and includes CORS headers for web application integration. This enables local AI applications with standard API tooling.


Further Reading

TAG
CATEGORIES