ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

Q: "What is ReasonFlux?"

"ReasonFlux is a hierarchical LLM reasoning framework developed by Gen-Verse that uses 500 curated thought templates to guide model reasoning. It was accepted at NeurIPS 2025 and demonstrates that a 32B parameter model using template-augmented reasoning can outperform much larger models like GPT-4 and o1-mini on complex reasoning benchmarks."

Q: "What is the thought template library in ReasonFlux?"

"The thought template library is a curated collection of 500 expert-designed reasoning patterns covering mathematics, code generation, logic, science, and common sense reasoning. Each template encodes a reusable thinking strategy -- like 'proof by contradiction' or 'divide-and-conquer' -- that can be retrieved and adapted for new problems rather than generated from scratch."

Q: "How does ReasonFlux's performance compare to o1-mini?"

"ReasonFlux with a 32B base model outperforms both GPT-4 and o1-mini on several key benchmarks including MATH-500 (96.0%), AIME 2024 (82.3%), and Olympiad-level math tasks. This is significant because it achieves superior reasoning with a smaller model, demonstrating that structured template guidance can dramatically improve reasoning efficiency."

Q: "What model sizes does ReasonFlux support?"

"ReasonFlux has been validated on models from 7B to 72B parameters. The 32B variant delivers the best performance-to-efficiency trade-off. Smaller models (7B-14B) benefit significantly from templates but show some degradation on the hardest problems. The framework is model-agnostic and compatible with any open-weight LLM including Llama, Qwen, DeepSeek, and Mistral."

Q: "What are the key innovations of ReasonFlux?"

"ReasonFlux introduces three key innovations: (1) a hierarchical reinforcement learning training method that teaches models to combine templates adaptively, (2) a reusable thought template library of 500 curated strategies, and (3) a template retrieval mechanism that selects the right reasoning pattern for each problem. Together, these innovations enable smaller models to punch far above their weight class."

ReasonFlux is a template-augmented reasoning framework using 500 thought templates and hierarchical RL to enable 32B models to outperform GPT-4 and o1-mini.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 02, 2026 6 min read

Large language models have made impressive strides in general knowledge and language generation, but complex reasoning – multi-step math problems, formal logic, algorithmic coding – remains a challenge, particularly for smaller models. ReasonFlux, developed by Gen-Verse and accepted at NeurIPS 2025, attacks this problem from a novel angle: rather than scaling up model size, it scales up the reasoning strategies available to the model.

The core insight behind ReasonFlux is elegant. Most reasoning failures in LLMs are not failures of knowledge – the model knows the relevant facts – but failures of approach. The model picks the wrong strategy, or tries to solve a problem in one shot when it should decompose it into steps. ReasonFlux addresses this by providing a curated library of 500 expert-designed thought templates, each encoding a reusable thinking strategy.

Through hierarchical reinforcement learning, ReasonFlux trains the base model not just to answer questions, but to recognize problem types, retrieve appropriate templates, and combine them adaptively. The results are striking: a 32B parameter model using ReasonFlux outperforms GPT-4 and OpenAI’s o1-mini on several key mathematical reasoning benchmarks.

How Does ReasonFlux’s Hierarchical RL Training Work?

The training process involves two levels of learning: template selection (which reasoning strategy to use) and template execution (how to apply it to the specific problem).

flowchart TD
    A["Training problem"] --> B["Problem classifier\nDetects problem type"]
    B --> C["Template retriever\nSelects relevant\nthought templates"]
    C --> D["Template Composer\nCombines templates\nhierarchically"]

    D --> E["Level 1: Strategy\nOverall approach\n(e.g., 'decompose')"]
    D --> F["Level 2: Tactics\nStep-by-step methods\n(e.g., 'substitute')"]
    D --> G["Level 3: Validation\nCheck & verify steps\n(e.g., 'test case')"]

    E --> H["Execute reasoning\nusing templates"]
    F --> H
    G --> H

    H --> I{"Answer\ncorrect?"}
    I -->|No| J["RL reward:\nnegative"]
    J --> C
    I -->|Yes| K["RL reward:\npositive"]
    K --> L["Update policy:\nreinforce this\ntemplate path"]

    style A fill:#1e1040,color:#ceb9ff
    style B fill:#0c3a3d,color:#8ff5ff
    style C fill:#1d2634,color:#a5abb8
    style D fill:#0c3a3d,color:#8ff5ff
    style E fill:#1e1040,color:#ceb9ff
    style F fill:#1e1040,color:#ceb9ff
    style G fill:#1e1040,color:#ceb9ff
    style J fill:#3d0c0c,color:#ff8f8f
    style K fill:#0c3a3d,color:#8ff5ff

The hierarchical RL approach trains the model to make decisions at multiple levels of abstraction. At Level 1, the model selects an overall strategy (proof by contradiction, divide-and-conquer, case analysis). At Level 2, it applies tactical sub-steps appropriate to that strategy. At Level 3, it validates intermediate results.

This hierarchy is critical because it mirrors how human experts reason: we don’t generate every step from scratch – we recognize problem patterns and apply known solution templates.

What Does the 500-Template Thought Library Contain?

The thought template library is the intellectual core of ReasonFlux. Each template is an expert-designed reasoning pattern that the model can retrieve, adapt, and combine.

Category	Number of Templates	Example Template	Sample Problem Type
Mathematical	180	Proof by contradiction, Induction, Invariant analysis	Olympiad math, number theory
Logical	100	Deductive chain, Case analysis, Reductio ad absurdum	Formal logic, puzzles
Coding	80	Divide and conquer, Dynamic programming, Greedy proof	Algorithm design
Scientific	70	Hypothesis testing, Controlled experiment, Causal inference	Physics, biology
Commonsense	70	Analogical reasoning, Counterfactual, Stepwise verification	Everyday reasoning

Each template contains: a natural language description of the strategy, a formal representation suitable for model fine-tuning, and examples of correct application across multiple domains.

How Does ReasonFlux Perform Against Larger Models?

The benchmark results are the strongest evidence for ReasonFlux’s effectiveness. A 32B model using the template library and hierarchical RL training outperforms models many times its size.

Benchmark	GPT-4	o1-mini	ReasonFlux (32B)	ReasonFlux (72B)
MATH-500	85.2%	91.8%	96.0%	97.1%
AIME 2024	63.4%	78.5%	82.3%	86.8%
GSM8K	92.0%	94.6%	96.2%	97.5%
MMLU-STEM	83.6%	87.2%	89.1%	91.3%
HumanEval	87.2%	90.4%	91.8%	93.5%

The 32B model consistently outperforms o1-mini across all benchmarks, and the 72B variant pushes even further ahead. This is particularly noteworthy because ReasonFlux models are open-weight and can be self-hosted, while GPT-4 and o1-mini are proprietary, API-only services.

Inference Cost Comparison

flowchart LR
    A["Model Comparison"] --> B["GPT-4\nHigh cost\nProprietary"]
    A --> C["o1-mini\nMedium cost\nProprietary"]
    A --> D["ReasonFlux 32B\nLow cost\nOpen source"]

    B --> E["~$15-30/M tokens\nAPI only"]
    C --> F["~$3-6/M tokens\nAPI only"]
    D --> G["~$0.5-1/M tokens\nSelf-hosted"]

    style B fill:#1e1040,color:#ceb9ff
    style C fill:#3d0c0c,color:#ff8f8f
    style D fill:#0c3a3d,color:#8ff5ff

Beyond raw accuracy, the cost advantage is dramatic. Self-hosting a 32B ReasonFlux model costs roughly 1/30th the per-token price of GPT-4, with comparable or superior reasoning quality.

What Are the Practical Implications of Template-Augmented Reasoning?

ReasonFlux’s approach has implications beyond benchmark performance.

Democratizing advanced reasoning: By enabling smaller, open-weight models to compete with proprietary giants, ReasonFlux makes sophisticated AI reasoning accessible to teams and organizations that cannot afford API-based models at scale.

Domain-specific customization: The template library can be extended with domain-specific reasoning patterns. A legal reasoning model could add templates for statutory interpretation and precedent analysis. A medical model could add diagnostic reasoning patterns.

Interpretable reasoning chains: Because templates encode explicit strategies, the model’s reasoning process is more interpretable than black-box approaches. Users can see which template was selected and how it was applied, making it easier to audit and debug reasoning failures.

FAQ

What is ReasonFlux? ReasonFlux is a hierarchical LLM reasoning framework developed by Gen-Verse that uses 500 curated thought templates to guide model reasoning. It was accepted at NeurIPS 2025 and demonstrates that a 32B parameter model using template-augmented reasoning can outperform much larger models like GPT-4 and o1-mini on complex reasoning benchmarks.

What is the thought template library in ReasonFlux? The thought template library is a curated collection of 500 expert-designed reasoning patterns covering mathematics, code generation, logic, science, and commonsense reasoning. Each template encodes a reusable thinking strategy – like ‘proof by contradiction’ or ‘divide-and-conquer’ – that can be retrieved and adapted for new problems rather than generated from scratch.

How does ReasonFlux’s performance compare to o1-mini? ReasonFlux with a 32B base model outperforms both GPT-4 and o1-mini on several key benchmarks including MATH-500 (96.0%), AIME 2024 (82.3%), and Olympiad-level math tasks. This is significant because it achieves superior reasoning with a smaller model, demonstrating that structured template guidance can dramatically improve reasoning efficiency.

What model sizes does ReasonFlux support? ReasonFlux has been validated on models from 7B to 72B parameters. The 32B variant delivers the best performance-to-efficiency trade-off. Smaller models (7B-14B) benefit significantly from templates but show some degradation on the hardest problems. The framework is model-agnostic and compatible with any open-weight LLM including Llama, Qwen, DeepSeek, and Mistral.

What are the key innovations of ReasonFlux? ReasonFlux introduces three key innovations: (1) a hierarchical reinforcement learning training method that teaches models to combine templates adaptively, (2) a reusable thought template library of 500 curated strategies, and (3) a template retrieval mechanism that selects the right reasoning pattern for each problem. Together, these innovations enable smaller models to punch far above their weight class.

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

How Does ReasonFlux’s Hierarchical RL Training Work?

What Does the 500-Template Thought Library Contain?

How Does ReasonFlux Perform Against Larger Models?

Inference Cost Comparison

What Are the Practical Implications of Template-Augmented Reasoning?

FAQ

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES