OpenManus-RL is an open-source research project at the intersection of reinforcement learning and LLM agent systems, developed collaboratively by Ulab-UIUC (University of Illinois Urbana-Champaign) and MetaGPT. The project provides a comprehensive framework for reinforcement learning tuning of LLM-based agents, with implementations of GRPO (Group Relative Policy Optimization), supervised fine-tuning (SFT), and advanced rollout strategies designed specifically for agentic tasks.
As LLM agents become increasingly capable of complex multi-step reasoning and tool use, the need for targeted reinforcement learning optimization has grown dramatically. OpenManus-RL addresses this by providing a modular, reproducible pipeline for training agents on agent-specific tasks, with built-in support for diverse environments including software engineering (SWE-Bench), web navigation (WebArena), and general tool use.
What is OpenManus-RL and why is it important?
OpenManus-RL is a training framework that applies reinforcement learning algorithms to optimize LLM agents for specific behavioral objectives. Rather than relying solely on supervised fine-tuning from static datasets, OpenManus-RL uses reward signals from environments to iteratively improve agent performance. This approach has proven critical for achieving state-of-the-art results on complex agent benchmarks where simple imitation learning falls short.
Training Methods Supported
| Method | Description | Use Case |
|---|---|---|
| GRPO | Group Relative Policy Optimization | Multi-trajectory reward comparison |
| SFT | Supervised Fine-Tuning | Initial behavior cloning from demonstrations |
| PPO | Proximal Policy Optimization | Single-trajectory reward optimization |
| Rejection Sampling | Filter best trajectories for training | Quality filtering |
| Iterative GRPO | Multi-round GRPO with evolving policy | Continuous improvement |
How does GRPO work for agent training?
GRPO (Group Relative Policy Optimization) is the core training algorithm in OpenManus-RL. Unlike standard RL methods that require a value function to estimate advantage, GRPO samples multiple trajectories from the policy, evaluates them using the environment’s reward function, and computes advantages relative to the group. This group-relative approach is particularly well-suited for agent tasks where reward signals are sparse but comparative trajectories provide rich learning signals.
flowchart TD
A[Base Policy Model] --> B[Sample N Trajectories]
B --> C[Trajectory 1]
B --> D[Trajectory 2]
B --> E[Trajectory N...]
C --> F[Environment Reward]
D --> F
E --> F
F --> G[Compute Group Advantage]
G --> H[Rank Trajectories]
H --> I[Update Policy via GRPO]
I --> B
H --> J[Best Trajectories]
J --> K[SFT Dataset]
K --> L[Supervised Fine-Tune]
L --> ABenchmark Results
OpenManus-RL has demonstrated significant improvements over base models across multiple agent benchmarks.
| Benchmark | Base Model | Base + SFT | Base + SFT + GRPO | Improvement |
|---|---|---|---|---|
| SWE-Bench Lite | 18.5% | 30.2% | 38.7% | +20.2% |
| WebArena | 14.2% | 22.8% | 29.5% | +15.3% |
| AgentBench | 35.1% | 48.3% | 56.2% | +21.1% |
| ToolBench | 52.4% | 63.1% | 71.8% | +19.4% |
What datasets are used for training?
OpenManus-RL provides curated training datasets derived from agent trajectories. The training data pipeline includes trajectory collection from multiple agent environments, reward annotation using both automated metrics and LLM-as-judge evaluations, quality filtering to remove low-quality or failed trajectories, and data augmentation through trajectory perturbation. The project also supports integration with user-provided task datasets for domain-specific tuning.
Architecture Overview
The system architecture consists of a training loop that connects an LLM policy with agent environments. The rollout engine manages parallel environment instances for efficient trajectory collection, while the reward model provides feedback signals. The RL trainer implements GRPO and PPO algorithms with support for distributed training across multiple GPUs.
sequenceDiagram
participant Policy as LLM Policy
participant Rollout as Rollout Engine
participant Env as Agent Environment
participant Reward as Reward Model
participant Trainer as RL Trainer
loop Training Step
Policy->>Rollout: Generate action distributions
Rollout->>Env: Launch N parallel instances
Env-->>Policy: State observations
Policy->>Env: Actions (code, browse, etc.)
Env-->>Rollout: Task completion signals
Rollout->>Reward: Submit trajectories
Reward-->>Rollout: Reward scores
Rollout-->>Trainer: Batched trajectories + rewards
Trainer->>Trainer: Compute GRPO loss
Trainer->>Policy: Update weights
endHow does OpenManus-RL compare to other RL frameworks?
OpenManus-RL distinguishes itself from general RL frameworks like RLHF (which focuses on preference tuning) and from agent-specific frameworks like EvoPrompt (which focuses on prompt optimization) by targeting the unique requirements of LLM agent training. Key differentiators include native support for trajectory-level rewards (rather than token-level), integration with popular agent environments out of the box, and the group-relative advantage computation that handles the sparse reward structure common in agent tasks.
What is the collaboration behind this project?
OpenManus-RL is a joint effort between Ulab-UIUC, led by Prof. Heng Ji at UIUC, and the MetaGPT team. This academic-industry collaboration brings together UIUC’s expertise in reinforcement learning and language agent research with MetaGPT’s practical experience building production-grade agent systems. The project has received contributions from researchers across multiple institutions and continues to evolve with the rapidly advancing field of agent RL.
Frequently Asked Questions
What is OpenManus-RL? It is an open-source framework for reinforcement learning tuning of LLM agents, using GRPO, SFT, and other methods to optimize agent performance on tasks like software engineering and web navigation.
What training methods does it support? GRPO (Group Relative Policy Optimization), SFT, PPO, rejection sampling, and iterative GRPO for continuous improvement.
What benchmarks has it been tested on? SWE-Bench, WebArena, AgentBench, and ToolBench, with improvements of 15-20% over base models.
What dataset is used? Curated trajectories from agent environments with automated and LLM-as-judge reward annotation, plus support for user-provided task datasets.
Who is behind OpenManus-RL? A collaboration between Ulab-UIUC (University of Illinois Urbana-Champaign) and MetaGPT.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!