AI

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

ReasonFlux is a template-augmented reasoning framework using 500 thought templates and hierarchical RL to enable 32B models to outperform GPT-4 and o1-mini.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

Large language models have made impressive strides in general knowledge and language generation, but complex reasoning – multi-step math problems, formal logic, algorithmic coding – remains a challenge, particularly for smaller models. ReasonFlux, developed by Gen-Verse and accepted at NeurIPS 2025, attacks this problem from a novel angle: rather than scaling up model size, it scales up the reasoning strategies available to the model.

The core insight behind ReasonFlux is elegant. Most reasoning failures in LLMs are not failures of knowledge – the model knows the relevant facts – but failures of approach. The model picks the wrong strategy, or tries to solve a problem in one shot when it should decompose it into steps. ReasonFlux addresses this by providing a curated library of 500 expert-designed thought templates, each encoding a reusable thinking strategy.

Through hierarchical reinforcement learning, ReasonFlux trains the base model not just to answer questions, but to recognize problem types, retrieve appropriate templates, and combine them adaptively. The results are striking: a 32B parameter model using ReasonFlux outperforms GPT-4 and OpenAI’s o1-mini on several key mathematical reasoning benchmarks.


How Does ReasonFlux’s Hierarchical RL Training Work?

The training process involves two levels of learning: template selection (which reasoning strategy to use) and template execution (how to apply it to the specific problem).

The hierarchical RL approach trains the model to make decisions at multiple levels of abstraction. At Level 1, the model selects an overall strategy (proof by contradiction, divide-and-conquer, case analysis). At Level 2, it applies tactical sub-steps appropriate to that strategy. At Level 3, it validates intermediate results.

This hierarchy is critical because it mirrors how human experts reason: we don’t generate every step from scratch – we recognize problem patterns and apply known solution templates.


What Does the 500-Template Thought Library Contain?

The thought template library is the intellectual core of ReasonFlux. Each template is an expert-designed reasoning pattern that the model can retrieve, adapt, and combine.

CategoryNumber of TemplatesExample TemplateSample Problem Type
Mathematical180Proof by contradiction, Induction, Invariant analysisOlympiad math, number theory
Logical100Deductive chain, Case analysis, Reductio ad absurdumFormal logic, puzzles
Coding80Divide and conquer, Dynamic programming, Greedy proofAlgorithm design
Scientific70Hypothesis testing, Controlled experiment, Causal inferencePhysics, biology
Commonsense70Analogical reasoning, Counterfactual, Stepwise verificationEveryday reasoning

Each template contains: a natural language description of the strategy, a formal representation suitable for model fine-tuning, and examples of correct application across multiple domains.


How Does ReasonFlux Perform Against Larger Models?

The benchmark results are the strongest evidence for ReasonFlux’s effectiveness. A 32B model using the template library and hierarchical RL training outperforms models many times its size.

BenchmarkGPT-4o1-miniReasonFlux (32B)ReasonFlux (72B)
MATH-50085.2%91.8%96.0%97.1%
AIME 202463.4%78.5%82.3%86.8%
GSM8K92.0%94.6%96.2%97.5%
MMLU-STEM83.6%87.2%89.1%91.3%
HumanEval87.2%90.4%91.8%93.5%

The 32B model consistently outperforms o1-mini across all benchmarks, and the 72B variant pushes even further ahead. This is particularly noteworthy because ReasonFlux models are open-weight and can be self-hosted, while GPT-4 and o1-mini are proprietary, API-only services.

Inference Cost Comparison

Beyond raw accuracy, the cost advantage is dramatic. Self-hosting a 32B ReasonFlux model costs roughly 1/30th the per-token price of GPT-4, with comparable or superior reasoning quality.


What Are the Practical Implications of Template-Augmented Reasoning?

ReasonFlux’s approach has implications beyond benchmark performance.

Democratizing advanced reasoning: By enabling smaller, open-weight models to compete with proprietary giants, ReasonFlux makes sophisticated AI reasoning accessible to teams and organizations that cannot afford API-based models at scale.

Domain-specific customization: The template library can be extended with domain-specific reasoning patterns. A legal reasoning model could add templates for statutory interpretation and precedent analysis. A medical model could add diagnostic reasoning patterns.

Interpretable reasoning chains: Because templates encode explicit strategies, the model’s reasoning process is more interpretable than black-box approaches. Users can see which template was selected and how it was applied, making it easier to audit and debug reasoning failures.


FAQ

What is ReasonFlux? ReasonFlux is a hierarchical LLM reasoning framework developed by Gen-Verse that uses 500 curated thought templates to guide model reasoning. It was accepted at NeurIPS 2025 and demonstrates that a 32B parameter model using template-augmented reasoning can outperform much larger models like GPT-4 and o1-mini on complex reasoning benchmarks.

What is the thought template library in ReasonFlux? The thought template library is a curated collection of 500 expert-designed reasoning patterns covering mathematics, code generation, logic, science, and commonsense reasoning. Each template encodes a reusable thinking strategy – like ‘proof by contradiction’ or ‘divide-and-conquer’ – that can be retrieved and adapted for new problems rather than generated from scratch.

How does ReasonFlux’s performance compare to o1-mini? ReasonFlux with a 32B base model outperforms both GPT-4 and o1-mini on several key benchmarks including MATH-500 (96.0%), AIME 2024 (82.3%), and Olympiad-level math tasks. This is significant because it achieves superior reasoning with a smaller model, demonstrating that structured template guidance can dramatically improve reasoning efficiency.

What model sizes does ReasonFlux support? ReasonFlux has been validated on models from 7B to 72B parameters. The 32B variant delivers the best performance-to-efficiency trade-off. Smaller models (7B-14B) benefit significantly from templates but show some degradation on the hardest problems. The framework is model-agnostic and compatible with any open-weight LLM including Llama, Qwen, DeepSeek, and Mistral.

What are the key innovations of ReasonFlux? ReasonFlux introduces three key innovations: (1) a hierarchical reinforcement learning training method that teaches models to combine templates adaptively, (2) a reusable thought template library of 500 curated strategies, and (3) a template retrieval mechanism that selects the right reasoning pattern for each problem. Together, these innovations enable smaller models to punch far above their weight class.


Further Reading

TAG
CATEGORIES