Most developers and researchers who work with large language models interact with them through high-level frameworks like PyTorch or Hugging Face Transformers. These frameworks hide immense complexity behind elegant APIs, but they also obscure the fundamental mechanics of how these models actually learn. llm.c tears away that abstraction, providing a complete, working implementation of GPT-2 training in pure C.
Created by Andrej Karpathy (formerly Director of AI at Tesla, co-founder of OpenAI), llm.c is first and foremost an educational project. It implements the entire forward pass, backward pass, and training loop for a transformer language model using nothing but standard C libraries, without a single dependency on PyTorch, TensorFlow, or any machine learning framework.
The educational philosophy is radical but effective: when you strip away the framework, every matrix multiplication, every gradient computation, every optimizer update must be written explicitly. There is no autograd hiding the backpropagation. There is no model.compile() abstracting the training loop. Every line of code corresponds directly to a mathematical operation in the transformer architecture.
How Does llm.c Implement the Full Training Pipeline?
llm.c implements every component of the transformer training pipeline in explicit C code.
graph LR
A[Input Text\nTokenized] --> B[Embedding Layer\nToken + Position Embeddings]
B --> C[Transformer Block x12\nSelf-Attention + FFN]
C --> D[Layer Norm + Final Projection]
D --> E[Cross-Entropy Loss]
E --> F[Backward Pass\nExplicit Gradients]
F --> G[Parameter Update\nAdam Optimizer in C]
G --> B
subgraph Backward
F --> H[Gradient wrt Embeddings]
F --> I[Gradient wrt Attention\nQ, K, V, Output]
F --> J[Gradient wrt FFN\nGate, Up, Down]
F --> K[Gradient wrt Layer Norm\nScale + Shift]
end
There is no autograd engine. Every gradient formula is derived and implemented by hand, making the mathematics of backpropagation completely transparent.
What Are the Key Implementation Details in llm.c?
The implementation covers every major component of the transformer training stack.
| Component | C Implementation | What You Learn |
|---|---|---|
| Token embeddings | Embedding lookup table | How tokens become vectors |
| Positional encodings | Learned position embeddings | How position information is added |
| Self-attention | QKV projection + softmax + aggregation | How attention weights are computed |
| Multi-head attention | Split/combine heads | Parallel attention computation |
| Feed-forward network | Two-layer MLP with GELU | How the FFN transforms representations |
| Layer normalization | Mean + variance computation | How normalization stabilizes training |
| Residual connections | Skip connections | How gradients flow through the network |
| Adam optimizer | Momentum + adaptive learning rates | How parameters are updated |
| Cross-entropy loss | Softmax + negative log likelihood | How loss measures prediction quality |
Each component is typically implemented in 50-100 lines of C, making the entire architecture comprehensible.
llm.c vs PyTorch Implementation: A Comparison
The contrast between llm.c’s approach and PyTorch’s abstraction is striking.
| Aspect | PyTorch Implementation | llm.c Implementation |
|---|---|---|
| Lines of code (full training) | ~500 (with framework) | ~5000 (no framework) |
| Backward pass | Automatic (autograd) | Manual (every gradient) |
| GPU support | Automatic (CUDA tensors) | Manual (custom CUDA kernels) |
| Dependencies | PyTorch, CUDA, tokenizers | Standard C library only |
| Readability (ML experts) | High (abstracted) | High (explicit) |
| Educational value | Good (high-level understanding) | Excellent (full understanding) |
The 10x increase in code is not bloat – it is the missing explanation that frameworks normally hide.
How Does the CUDA Version Accelerate Training?
The CUDA implementation provides meaningful training speeds while maintaining educational clarity.
| Component | CPU Implementation | CUDA Kernel |
|---|---|---|
| Matrix multiplication | Nested loops | Shared memory tiling |
| Softmax | Sequential computation | Warp-level reduction |
| Layer norm | Sequential statistics | Parallel reduction |
| Attention | Full O(n^2) matrix | Memory-efficient kernel |
| Adam update | Loop over parameters | Element-wise parallelism |
The CUDA kernels are written to balance performance with readability, serving as practical examples of GPU programming for ML workloads.
FAQ
What is llm.c? llm.c is Andrej Karpathy’s educational implementation of GPT-2 training in pure C/CUDA. It implements the complete forward pass, backward pass, and training loop for a transformer language model using only standard C libraries and CUDA, with no dependencies on PyTorch, TensorFlow, or any ML framework. It is designed to be a clean, readable reference for understanding exactly how LLM training works at every level.
Why is llm.c implemented in pure C? Karpathy implemented llm.c in C to remove the abstraction layers that frameworks like PyTorch introduce. In PyTorch, the backward pass is handled automatically by autograd, making it opaque. In C, every gradient computation must be written explicitly, providing complete visibility into the training mechanics. This makes llm.c an unparalleled educational tool for understanding the mathematics and implementation of transformer training.
What does the CUDA version of llm.c add? The CUDA version of llm.c extends the C implementation with GPU-accelerated kernels for all operations. Each layer of the transformer (self-attention, feed-forward, layer norm, embedding) is implemented as a custom CUDA kernel. This allows the implementation to train at meaningful speeds (training a small GPT-2 on a single GPU in hours) while maintaining educational clarity.
Can I actually train a model with llm.c? Yes, llm.c can train a full GPT-2 model. The C version can train a small model on CPU for educational purposes. The CUDA version can train a 124M parameter GPT-2 on a single GPU, achieving training speeds comparable to PyTorch implementations. The training produces real checkpoints that can generate text, matching the quality of framework-based implementations.
What can you learn from studying llm.c? Studying llm.c provides a deep understanding of the full transformer training stack: how self-attention is computed (Q, K, V projections, softmax, weighted aggregation), how backpropagation flows through each layer, how layer normalization and residual connections work, how the Adam optimizer updates parameters, how tokenization and embeddings work, and how CUDA kernels accelerate training.
Further Reading
- llm.c GitHub Repository – Source code, documentation, and examples by Andrej Karpathy
- Let’s Build GPT from Scratch – Karpathy’s video lecture on building a GPT from scratch
- Attention Is All You Need (ArXiv) – The original transformer paper
- CUDA Programming Guide – Official NVIDIA CUDA documentation
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!