"TinyZero is an open-source, minimalist reproduction of DeepSeek R1-Zero's reinforcement learning approach for training language models to reason. Created by researcher Jiayi Pan, the project demonstrates that emergent reasoning behaviors — self-verification, reflection, and extended chain-of-thought — can arise in models as small as 1.5B parameters when trained with reinforcement learning on countdown tasks, all for under $30 in compute costs."

TinyZero: Reproducing DeepSeek R1-Zero's Reasoning with RL for Under $30

Q: "How does TinyZero reproduce DeepSeek R1-Zero?"

"TinyZero uses the veRL (versatile Reinforcement Learning) framework to apply PPO (Proximal Policy Optimization) to the Qwen-2.5-1.5B-Instruct and 7B base models. The model is trained on a countdown-based mathematical reasoning task where it must combine four numbers using arithmetic operations to reach a target. Through RL training, the model naturally discovers advanced reasoning patterns — aha moments, self-correction, and step-by-step verification — without any supervised fine-tuning or human-curated reasoning data."

Q: "How can it cost under $30 to reproduce R1-Zero?"

"The training runs use small base models (1.5B or 7B parameters) trained for approximately 200-400 steps on a single GPU. Using a rented cloud instance with an NVIDIA A100 or RTX 4090, the total compute cost ranges from $15 to $30 depending on instance type and training duration. This dramatically contrasts with the thousands or millions of dollars typically associated with RL training for large language models."

Q: "What emergent behaviors does TinyZero exhibit?"

"TinyZero models develop several emergent reasoning behaviors without explicit programming: self-verification where the model checks its own work, backtracking and correction when it detects errors, reflection on intermediate results, extended chain-of-thought reasoning, and even behaviors resembling an \"aha moment\" where the model suddenly improves its reasoning strategy mid-response. These are the same class of phenomena reported in DeepSeek R1-Zero but appearing at a dramatically smaller scale."

Q: "What hardware is needed to run TinyZero?"

"For inference (running the trained model), any modern GPU with at least 8GB VRAM is sufficient. For training the 1.5B model, a single A100 80GB or RTX 4090 (24GB) is adequate. The full RL training pipeline runs on one GPU, making it accessible to individual researchers and students. The codebase is designed to be minimal and easy to set up on any Linux system with CUDA support."

TinyZero is a minimal reproduction of DeepSeek R1-Zero using reinforcement learning and the veRL framework, demonstrating emergent reasoning in small language models.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 7 min read

DeepSeek R1-Zero was widely regarded as a breakthrough when it was released in January 2025. The model demonstrated that pure reinforcement learning — without any supervised fine-tuning on human reasoning examples — could produce advanced chain-of-thought reasoning, self-correction, and even surprising “aha moments” where the model independently discovered better reasoning strategies mid-conversation. The catch? The training infrastructure was assumed to require massive compute clusters and budgets in the tens of millions of dollars.

Jiayi Pan’s TinyZero shatters that assumption entirely.

TinyZero is an open-source, minimal reproduction of the DeepSeek R1-Zero methodology that runs on a single GPU for under $30 in cloud compute costs. Using the veRL framework — a versatile reinforcement learning library for language models — TinyZero applies PPO (Proximal Policy Optimization) to small base models like Qwen-2.5-1.5B-Instruct and Qwen-2.5-7B. The training task is deceptively simple: given four numbers, the model must combine them using arithmetic operations (+, -, *, /) to reach a target value. Yet from this humble starting point, the same emergent reasoning behaviors that made DeepSeek R1-Zero famous begin to appear.

The implications are profound. If reasoning abilities can be unlocked in a 1.5B parameter model for pocket change, the barrier to entry for reinforcement learning research in language models drops by several orders of magnitude. This is not just a reproduction — it is a democratization of one of the most important AI techniques of the decade.

What Is TinyZero and Why Should You Care?

TinyZero is first and foremost a research reproduction. Its primary goal is to demonstrate that the key findings of DeepSeek R1-Zero — that reinforcement learning alone can produce sophisticated reasoning in language models — are not dependent on massive model scales or proprietary infrastructure. The project achieves this by distilling the core RL methodology and applying it to much smaller models on a focused mathematical task.

Aspect	TinyZero	DeepSeek R1-Zero
Base Model	Qwen-2.5-1.5B / 7B	DeepSeek-V3 (671B)
Training Framework	veRL (open source)	Proprietary
Training Cost	Under $30	Millions of dollars
GPU Requirement	Single GPU	Large clusters
RL Algorithm	PPO	GRPO (Group Relative Policy Optimization)
Training Task	Countdown arithmetic	Diverse reasoning tasks
Emergent Behaviors	Self-verification, reflection, “aha moments”	Self-verification, reflection, “aha moments”

How Does TinyZero Reproduce DeepSeek R1-Zero?

The methodology behind TinyZero is elegant in its simplicity. The veRL framework wraps the base language model and applies PPO reinforcement learning on a reward signal derived solely from the correctness of the model’s arithmetic solutions. There are no human-curated reasoning chains, no supervised fine-tuning steps, and no hand-crafted prompts showing the model how to think — the model must discover reasoning strategies entirely through trial and error.

The training pipeline proceeds as follows:

1. Base model (Qwen-2.5) receives a countdown task prompt
2. Model generates a response with chain-of-thought
3. Reward is computed: +1 for correct final answer, otherwise 0
4. PPO updates model parameters based on reward
5. Repeat for ~200-400 training steps

What Emergent Behaviors Does TinyZero Exhibit?

The most fascinating aspect of TinyZero is what the model learns to do without being explicitly taught. Within a few hundred training steps, the model naturally develops:

Self-verification: The model checks its intermediate calculations before committing to a final answer.
Backtracking and correction: When the model detects an error in its reasoning, it explicitly marks the mistake and restarts with a corrected approach.
Extended reasoning chains: Response length increases from simple one-line answers to multi-step reasoning spanning hundreds of tokens.
Reflective reasoning: The model evaluates its own thought process with statements like “Wait, let me check that calculation again.”
Strategic exploration: The model tries multiple approaches within a single response, evaluating each before selecting the best path forward.

The table below compares the reasoning behaviors observed across different model sizes:

Behavior	Qwen-2.5-0.5B	Qwen-2.5-1.5B	Qwen-2.5-7B
Self-verification	Rare	Frequent	Consistent
Backtracking	Absent	Occasional	Frequent
Extended CoT (>200 tokens)	No	Yes	Yes
Multi-strategy exploration	No	Rare	Frequent
“Aha moment” reset	No	Occasional	Yes
Training cost (A100 80GB)	~$5	~$15	~$30

How Is TinyZero Built and What Can It Teach Us?

The repository is structured as a minimal fork of the veRL framework with a single task implementation. The entire training setup is contained in a few hundred lines of Python and shell scripts. This minimalism is intentional — reducing the code to its essential elements makes the methodology transparent and reproducible.

Key architectural choices include:

veRL Framework: Provides the PPO training loop, rollout generation, and reward computation infrastructure
Countdown Task: A well-defined, verifiable task with unambiguous reward signals
Small Base Models: Qwen-2.5 series models that are widely available and inexpensive to run
Minimal Configuration: Straightforward hyperparameters that require minimal tuning

flowchart LR
    A[Base Model<br/>Qwen-2.5] --> B[Countdown Task<br/>Prompt]
    B --> C[Model Generates<br/>Response with CoT]
    C --> D[Reward Computation<br/>Correct = +1, Wrong = 0]
    D --> E[PPO Update<br/>via veRL]
    E --> B

How Much Does TinyZero Cost to Run?

The headline figure of “under $30” deserves scrutiny. The actual cost depends on the base model size and training duration:

Component	1.5B Model	7B Model
Cloud GPU (Lambda Labs A100)	~$1.10/hr	~$1.10/hr
Training steps	~200	~300
Training time	~6 hours	~24 hours
Estimated total	~$7	~$27

These costs assume spot/preemptible instances or reserved cloud GPU time. On-demand pricing at major cloud providers would be approximately 2-3x higher. Additionally, developers should account for approximately 2-5 hours of setup and debugging time on the first run.

What Are the Limitations of TinyZero?

While TinyZero is impressive as a proof of concept, it does have important limitations that practitioners should understand:

Task specificity: The reproduction focuses on a single task (countdown arithmetic). Generalizing to the diverse reasoning tasks shown in R1-Zero would require more training data and compute.
Model scale: Emergent behaviors at 1.5B parameters are less reliable and consistent than at 7B or the original 671B scale. Larger models show qualitatively better reasoning.
Training stability: PPO training on small models can be unstable without careful hyperparameter tuning. The veRL framework manages this, but results may vary across random seeds.
Evaluation depth: The countdown task provides a clean reward signal but does not test for reasoning generalization to unseen problem types.

Frequently Asked Questions

What is TinyZero?

TinyZero is an open-source, minimalist reproduction of DeepSeek R1-Zero’s reinforcement learning approach for training language models to reason. Created by researcher Jiayi Pan, the project demonstrates that emergent reasoning behaviors can arise in models as small as 1.5B parameters when trained with RL on countdown tasks, all for under $30 in compute costs.

How does TinyZero reproduce DeepSeek R1-Zero?

TinyZero uses the veRL framework to apply PPO reinforcement learning to Qwen-2.5 base models. The model is trained on a countdown-based mathematical reasoning task. Through RL training, the model naturally discovers advanced reasoning patterns without any supervised fine-tuning or human-curated reasoning data.

How can it cost under $30 to reproduce R1-Zero?

The training runs use small base models trained for approximately 200-400 steps on a single GPU. Using a rented cloud instance with an NVIDIA A100 or RTX 4090, total compute cost ranges from $15 to $30. This dramatically contrasts with the millions typically associated with RL training for large language models.

What emergent behaviors does TinyZero exhibit?

TinyZero models develop self-verification, backtracking and correction when it detects errors, reflection on intermediate results, extended chain-of-thought reasoning, and behaviors resembling an “aha moment” where the model suddenly improves its reasoning strategy mid-response.

What hardware is needed to run TinyZero?

For inference, any modern GPU with at least 8GB VRAM. For training the 1.5B model, a single A100 80GB or RTX 4090 is adequate. The full RL training pipeline runs on one GPU, making it accessible to individual researchers and students.

TinyZero: Reproducing DeepSeek R1-Zero's Reasoning with RL for Under $30

What Is TinyZero and Why Should You Care?

How Does TinyZero Reproduce DeepSeek R1-Zero?

What Emergent Behaviors Does TinyZero Exhibit?

How Is TinyZero Built and What Can It Teach Us?

How Much Does TinyZero Cost to Run?

What Are the Limitations of TinyZero?

Frequently Asked Questions

What is TinyZero?

How does TinyZero reproduce DeepSeek R1-Zero?

How can it cost under $30 to reproduce R1-Zero?

What emergent behaviors does TinyZero exhibit?

What hardware is needed to run TinyZero?

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES