Training powerful reasoning models traditionally requires massive, human-curated datasets, which are expensive and hard to scale. R-Zero is a novel framework that enables LLMs to improve their reasoning abilities autonomously, without needing any pre-existing tasks or labels. It's a truly self-evolving system that learns from scratch.
At its core, R-Zero sets up a dynamic co-evolutionary loop between two instances of the same base model:
This process creates a perfectly tailored, adaptive curriculum. The Challenger learns to ask better questions, and the Solver learns to find better answers. The entire cycle is self-contained, using techniques like majority voting for pseudo-labels and relative policy optimization to guide the learning.
At its core, the method orchestrates a co-evolution between a "Challenger" and a "Solver" model.
Model Name | Overall AVG | MATH AVG | SuperGPQA | MMLU-Pro | BBEH |
---|---|---|---|---|---|
Qwen3-4B-Base | |||||
Base Model | 27.10 | 42.58 | 20.88 | 37.38 | 7.57 |
Base Challenger | 30.83 | 44.36 | 24.77 | 47.59 | 6.59 |
R-Zero (Iter 1) | 34.27 | 48.06 | 27.92 | 51.69 | 9.42 |
R-Zero (Iter 2) | 34.92 | 48.44 | 27.72 | 53.75 | 9.76 |
R-Zero (Iter 3) | 34.64 | 49.07 | 27.55 | 51.53 | 10.42 |
Qwen3-8B-Base | |||||
Base Model | 34.49 | 49.18 | 28.33 | 51.80 | 8.63 |
Base Challenger | 36.43 | 51.87 | 30.12 | 54.14 | 9.60 |
R-Zero (Iter 1) | 37.93 | 53.39 | 31.26 | 57.17 | 9.91 |
R-Zero (Iter 2) | 38.45 | 53.84 | 31.58 | 58.20 | 10.20 |
R-Zero (Iter 3) | 38.73 | 54.69 | 31.38 | 58.23 | 10.60 |
@misc{huang2025rzeroselfevolvingreasoningllm,
title={R-Zero: Self-Evolving Reasoning LLM from Zero Data},
author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu},
year={2025},
eprint={2508.05004},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.05004},
}