AI

llm.c: Karpathy's Minimal C Implementation of LLM Training

llm.c is Andrej Karpathy's clean, minimal implementation of LLM training in pure C/CUDA, designed for educational understanding of how transformers work.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
llm.c: Karpathy's Minimal C Implementation of LLM Training

Most developers and researchers who work with large language models interact with them through high-level frameworks like PyTorch or Hugging Face Transformers. These frameworks hide immense complexity behind elegant APIs, but they also obscure the fundamental mechanics of how these models actually learn. llm.c tears away that abstraction, providing a complete, working implementation of GPT-2 training in pure C.

Created by Andrej Karpathy (formerly Director of AI at Tesla, co-founder of OpenAI), llm.c is first and foremost an educational project. It implements the entire forward pass, backward pass, and training loop for a transformer language model using nothing but standard C libraries, without a single dependency on PyTorch, TensorFlow, or any machine learning framework.

The educational philosophy is radical but effective: when you strip away the framework, every matrix multiplication, every gradient computation, every optimizer update must be written explicitly. There is no autograd hiding the backpropagation. There is no model.compile() abstracting the training loop. Every line of code corresponds directly to a mathematical operation in the transformer architecture.


How Does llm.c Implement the Full Training Pipeline?

llm.c implements every component of the transformer training pipeline in explicit C code.

graph LR
    A[Input Text\nTokenized] --> B[Embedding Layer\nToken + Position Embeddings]
    B --> C[Transformer Block x12\nSelf-Attention + FFN]
    C --> D[Layer Norm + Final Projection]
    D --> E[Cross-Entropy Loss]
    E --> F[Backward Pass\nExplicit Gradients]
    F --> G[Parameter Update\nAdam Optimizer in C]
    G --> B
    subgraph Backward
        F --> H[Gradient wrt Embeddings]
        F --> I[Gradient wrt Attention\nQ, K, V, Output]
        F --> J[Gradient wrt FFN\nGate, Up, Down]
        F --> K[Gradient wrt Layer Norm\nScale + Shift]
    end

There is no autograd engine. Every gradient formula is derived and implemented by hand, making the mathematics of backpropagation completely transparent.


What Are the Key Implementation Details in llm.c?

The implementation covers every major component of the transformer training stack.

ComponentC ImplementationWhat You Learn
Token embeddingsEmbedding lookup tableHow tokens become vectors
Positional encodingsLearned position embeddingsHow position information is added
Self-attentionQKV projection + softmax + aggregationHow attention weights are computed
Multi-head attentionSplit/combine headsParallel attention computation
Feed-forward networkTwo-layer MLP with GELUHow the FFN transforms representations
Layer normalizationMean + variance computationHow normalization stabilizes training
Residual connectionsSkip connectionsHow gradients flow through the network
Adam optimizerMomentum + adaptive learning ratesHow parameters are updated
Cross-entropy lossSoftmax + negative log likelihoodHow loss measures prediction quality

Each component is typically implemented in 50-100 lines of C, making the entire architecture comprehensible.


llm.c vs PyTorch Implementation: A Comparison

The contrast between llm.c’s approach and PyTorch’s abstraction is striking.

AspectPyTorch Implementationllm.c Implementation
Lines of code (full training)~500 (with framework)~5000 (no framework)
Backward passAutomatic (autograd)Manual (every gradient)
GPU supportAutomatic (CUDA tensors)Manual (custom CUDA kernels)
DependenciesPyTorch, CUDA, tokenizersStandard C library only
Readability (ML experts)High (abstracted)High (explicit)
Educational valueGood (high-level understanding)Excellent (full understanding)

The 10x increase in code is not bloat – it is the missing explanation that frameworks normally hide.


How Does the CUDA Version Accelerate Training?

The CUDA implementation provides meaningful training speeds while maintaining educational clarity.

ComponentCPU ImplementationCUDA Kernel
Matrix multiplicationNested loopsShared memory tiling
SoftmaxSequential computationWarp-level reduction
Layer normSequential statisticsParallel reduction
AttentionFull O(n^2) matrixMemory-efficient kernel
Adam updateLoop over parametersElement-wise parallelism

The CUDA kernels are written to balance performance with readability, serving as practical examples of GPU programming for ML workloads.


FAQ

What is llm.c? llm.c is Andrej Karpathy’s educational implementation of GPT-2 training in pure C/CUDA. It implements the complete forward pass, backward pass, and training loop for a transformer language model using only standard C libraries and CUDA, with no dependencies on PyTorch, TensorFlow, or any ML framework. It is designed to be a clean, readable reference for understanding exactly how LLM training works at every level.

Why is llm.c implemented in pure C? Karpathy implemented llm.c in C to remove the abstraction layers that frameworks like PyTorch introduce. In PyTorch, the backward pass is handled automatically by autograd, making it opaque. In C, every gradient computation must be written explicitly, providing complete visibility into the training mechanics. This makes llm.c an unparalleled educational tool for understanding the mathematics and implementation of transformer training.

What does the CUDA version of llm.c add? The CUDA version of llm.c extends the C implementation with GPU-accelerated kernels for all operations. Each layer of the transformer (self-attention, feed-forward, layer norm, embedding) is implemented as a custom CUDA kernel. This allows the implementation to train at meaningful speeds (training a small GPT-2 on a single GPU in hours) while maintaining educational clarity.

Can I actually train a model with llm.c? Yes, llm.c can train a full GPT-2 model. The C version can train a small model on CPU for educational purposes. The CUDA version can train a 124M parameter GPT-2 on a single GPU, achieving training speeds comparable to PyTorch implementations. The training produces real checkpoints that can generate text, matching the quality of framework-based implementations.

What can you learn from studying llm.c? Studying llm.c provides a deep understanding of the full transformer training stack: how self-attention is computed (Q, K, V projections, softmax, weighted aggregation), how backpropagation flows through each layer, how layer normalization and residual connections work, how the Adam optimizer updates parameters, how tokenization and embeddings work, and how CUDA kernels accelerate training.


Further Reading

TAG
CATEGORIES