AI

TinyZero: Reproducing DeepSeek R1-Zero's Reasoning with RL for Under $30

TinyZero is a minimal reproduction of DeepSeek R1-Zero using reinforcement learning and the veRL framework, demonstrating emergent reasoning in small language models.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
TinyZero: Reproducing DeepSeek R1-Zero's Reasoning with RL for Under $30

DeepSeek R1-Zero was widely regarded as a breakthrough when it was released in January 2025. The model demonstrated that pure reinforcement learning — without any supervised fine-tuning on human reasoning examples — could produce advanced chain-of-thought reasoning, self-correction, and even surprising “aha moments” where the model independently discovered better reasoning strategies mid-conversation. The catch? The training infrastructure was assumed to require massive compute clusters and budgets in the tens of millions of dollars.

Jiayi Pan’s TinyZero shatters that assumption entirely.

TinyZero is an open-source, minimal reproduction of the DeepSeek R1-Zero methodology that runs on a single GPU for under $30 in cloud compute costs. Using the veRL framework — a versatile reinforcement learning library for language models — TinyZero applies PPO (Proximal Policy Optimization) to small base models like Qwen-2.5-1.5B-Instruct and Qwen-2.5-7B. The training task is deceptively simple: given four numbers, the model must combine them using arithmetic operations (+, -, *, /) to reach a target value. Yet from this humble starting point, the same emergent reasoning behaviors that made DeepSeek R1-Zero famous begin to appear.

The implications are profound. If reasoning abilities can be unlocked in a 1.5B parameter model for pocket change, the barrier to entry for reinforcement learning research in language models drops by several orders of magnitude. This is not just a reproduction — it is a democratization of one of the most important AI techniques of the decade.

What Is TinyZero and Why Should You Care?

TinyZero is first and foremost a research reproduction. Its primary goal is to demonstrate that the key findings of DeepSeek R1-Zero — that reinforcement learning alone can produce sophisticated reasoning in language models — are not dependent on massive model scales or proprietary infrastructure. The project achieves this by distilling the core RL methodology and applying it to much smaller models on a focused mathematical task.

AspectTinyZeroDeepSeek R1-Zero
Base ModelQwen-2.5-1.5B / 7BDeepSeek-V3 (671B)
Training FrameworkveRL (open source)Proprietary
Training CostUnder $30Millions of dollars
GPU RequirementSingle GPULarge clusters
RL AlgorithmPPOGRPO (Group Relative Policy Optimization)
Training TaskCountdown arithmeticDiverse reasoning tasks
Emergent BehaviorsSelf-verification, reflection, “aha moments”Self-verification, reflection, “aha moments”

How Does TinyZero Reproduce DeepSeek R1-Zero?

The methodology behind TinyZero is elegant in its simplicity. The veRL framework wraps the base language model and applies PPO reinforcement learning on a reward signal derived solely from the correctness of the model’s arithmetic solutions. There are no human-curated reasoning chains, no supervised fine-tuning steps, and no hand-crafted prompts showing the model how to think — the model must discover reasoning strategies entirely through trial and error.

The training pipeline proceeds as follows:

1. Base model (Qwen-2.5) receives a countdown task prompt
2. Model generates a response with chain-of-thought
3. Reward is computed: +1 for correct final answer, otherwise 0
4. PPO updates model parameters based on reward
5. Repeat for ~200-400 training steps

What Emergent Behaviors Does TinyZero Exhibit?

The most fascinating aspect of TinyZero is what the model learns to do without being explicitly taught. Within a few hundred training steps, the model naturally develops:

  • Self-verification: The model checks its intermediate calculations before committing to a final answer.
  • Backtracking and correction: When the model detects an error in its reasoning, it explicitly marks the mistake and restarts with a corrected approach.
  • Extended reasoning chains: Response length increases from simple one-line answers to multi-step reasoning spanning hundreds of tokens.
  • Reflective reasoning: The model evaluates its own thought process with statements like “Wait, let me check that calculation again.”
  • Strategic exploration: The model tries multiple approaches within a single response, evaluating each before selecting the best path forward.

The table below compares the reasoning behaviors observed across different model sizes:

BehaviorQwen-2.5-0.5BQwen-2.5-1.5BQwen-2.5-7B
Self-verificationRareFrequentConsistent
BacktrackingAbsentOccasionalFrequent
Extended CoT (>200 tokens)NoYesYes
Multi-strategy explorationNoRareFrequent
“Aha moment” resetNoOccasionalYes
Training cost (A100 80GB)~$5~$15~$30

How Is TinyZero Built and What Can It Teach Us?

The repository is structured as a minimal fork of the veRL framework with a single task implementation. The entire training setup is contained in a few hundred lines of Python and shell scripts. This minimalism is intentional — reducing the code to its essential elements makes the methodology transparent and reproducible.

Key architectural choices include:

  • veRL Framework: Provides the PPO training loop, rollout generation, and reward computation infrastructure
  • Countdown Task: A well-defined, verifiable task with unambiguous reward signals
  • Small Base Models: Qwen-2.5 series models that are widely available and inexpensive to run
  • Minimal Configuration: Straightforward hyperparameters that require minimal tuning

How Much Does TinyZero Cost to Run?

The headline figure of “under $30” deserves scrutiny. The actual cost depends on the base model size and training duration:

Component1.5B Model7B Model
Cloud GPU (Lambda Labs A100)~$1.10/hr~$1.10/hr
Training steps~200~300
Training time~6 hours~24 hours
Estimated total~$7~$27

These costs assume spot/preemptible instances or reserved cloud GPU time. On-demand pricing at major cloud providers would be approximately 2-3x higher. Additionally, developers should account for approximately 2-5 hours of setup and debugging time on the first run.

What Are the Limitations of TinyZero?

While TinyZero is impressive as a proof of concept, it does have important limitations that practitioners should understand:

  1. Task specificity: The reproduction focuses on a single task (countdown arithmetic). Generalizing to the diverse reasoning tasks shown in R1-Zero would require more training data and compute.
  2. Model scale: Emergent behaviors at 1.5B parameters are less reliable and consistent than at 7B or the original 671B scale. Larger models show qualitatively better reasoning.
  3. Training stability: PPO training on small models can be unstable without careful hyperparameter tuning. The veRL framework manages this, but results may vary across random seeds.
  4. Evaluation depth: The countdown task provides a clean reward signal but does not test for reasoning generalization to unseen problem types.

Frequently Asked Questions

What is TinyZero?

TinyZero is an open-source, minimalist reproduction of DeepSeek R1-Zero’s reinforcement learning approach for training language models to reason. Created by researcher Jiayi Pan, the project demonstrates that emergent reasoning behaviors can arise in models as small as 1.5B parameters when trained with RL on countdown tasks, all for under $30 in compute costs.

How does TinyZero reproduce DeepSeek R1-Zero?

TinyZero uses the veRL framework to apply PPO reinforcement learning to Qwen-2.5 base models. The model is trained on a countdown-based mathematical reasoning task. Through RL training, the model naturally discovers advanced reasoning patterns without any supervised fine-tuning or human-curated reasoning data.

How can it cost under $30 to reproduce R1-Zero?

The training runs use small base models trained for approximately 200-400 steps on a single GPU. Using a rented cloud instance with an NVIDIA A100 or RTX 4090, total compute cost ranges from $15 to $30. This dramatically contrasts with the millions typically associated with RL training for large language models.

What emergent behaviors does TinyZero exhibit?

TinyZero models develop self-verification, backtracking and correction when it detects errors, reflection on intermediate results, extended chain-of-thought reasoning, and behaviors resembling an “aha moment” where the model suddenly improves its reasoning strategy mid-response.

What hardware is needed to run TinyZero?

For inference, any modern GPU with at least 8GB VRAM. For training the 1.5B model, a single A100 80GB or RTX 4090 is adequate. The full RL training pipeline runs on one GPU, making it accessible to individual researchers and students.

Further Reading

TAG
CATEGORIES