R-Zero: Self-Evolving Reasoning LLM from Zero Data

¹Tencent AI Seattle Lab, ²Washington University in St. Louis,
³University of Maryland, College Park, ⁴The University of Texas at Dallas

Abstract

Training powerful reasoning models traditionally requires massive, human-curated datasets, which are expensive and hard to scale. R-Zero is a novel framework that enables LLMs to improve their reasoning abilities autonomously, without needing any pre-existing tasks or labels. It's a truly self-evolving system that learns from scratch.

At its core, R-Zero sets up a dynamic co-evolutionary loop between two instances of the same base model:

The Challenger 🎯: Its job is to probe the Solver for weaknesses and generate challenging problems that are right at the edge of its capabilities.
The Solver 🧠: Its goal is to continuously improve by solving the increasingly difficult tasks posed by the Challenger.

This process creates a perfectly tailored, adaptive curriculum. The Challenger learns to ask better questions, and the Solver learns to find better answers. The entire cycle is self-contained, using techniques like majority voting for pseudo-labels and relative policy optimization to guide the learning.

Method

At its core, the method orchestrates a co-evolution between a "Challenger" and a "Solver" model.

The Challenger: This model is rewarded for generating problems that are precisely at the edge of the Solver's current abilities.
The Solver: This model attempts to solve the increasingly difficult tasks posed by the Challenger.
Data Generation: For the problems created by the Challenger, reliable pseudo-labels are created by taking a majority vote on the Solver's own attempts. This process constructs a new training dataset that is used to fine-tune the Solver.

Please refer to the paper for more details.

Results

Results on general-domain reasoning benchmarks. The table compares the Base Model, a **Base Challenger** baseline, and our iterative method (R-Zero). The peak performance achieved during each model's training process is highlighted in **bold**.
Model Name	Overall AVG	MATH AVG	SuperGPQA	MMLU-Pro	BBEH
Qwen3-4B-Base
Base Model	27.10	42.58	20.88	37.38	7.57
Base Challenger	30.83	44.36	24.77	47.59	6.59
R-Zero (Iter 1)	34.27	48.06	27.92	51.69	9.42
R-Zero (Iter 2)	34.92	48.44	27.72	53.75	9.76
R-Zero (Iter 3)	34.64	49.07	27.55	51.53	10.42
Qwen3-8B-Base
Base Model	34.49	49.18	28.33	51.80	8.63
Base Challenger	36.43	51.87	30.12	54.14	9.60
R-Zero (Iter 1)	37.93	53.39	31.26	57.17	9.91
R-Zero (Iter 2)	38.45	53.84	31.58	58.20	10.20
R-Zero (Iter 3)	38.73	54.69	31.38	58.23	10.60

Results on general-domain reasoning benchmarks. The table compares the Base Model, a Base Challenger baseline, and our iterative method (R-Zero). The peak performance achieved during each model's training process is highlighted in bold.

Model Name

Overall AVG

MATH AVG

SuperGPQA

MMLU-Pro

BBEH

Qwen3-4B-Base

Base Model

27.10

42.58

20.88

37.38

7.57

Base Challenger

30.83

44.36

24.77

47.59

6.59

R-Zero (Iter 1)

34.27

48.06

27.92

51.69

9.42

R-Zero (Iter 2)

34.92

48.44

27.72

53.75

9.76

R-Zero (Iter 3)

34.64

49.07

27.55

51.53

10.42

Qwen3-8B-Base

Base Model

34.49

49.18

28.33

51.80

8.63

Base Challenger

36.43

51.87

30.12

54.14

9.60

R-Zero (Iter 1)

37.93

53.39

31.26

57.17

9.91

R-Zero (Iter 2)

38.45

53.84

31.58

58.20

10.20

R-Zero (Iter 3)

38.73

54.69

31.38

58.23

10.60

1. Substantial Improvements in Mathematical Reasoning

Consistent & Significant Gains: The iterative training process consistently and substantially improves upon the performance of base models. For instance, on the Qwen3-8B-Base model, three iterations of R-Zero raised the average performance from a baseline of 49.18 to 54.69, a significant gain of +5.51 points.
Progressive Improvement: Performance gains are progressive, with a clear trend of improvement across iterations. This consistent growth underscores the benefits of the co-evolutionary dynamic, where the Solver learns from an increasingly challenging curriculum.
Critical Role of RL-Trained Challenger: The necessity of the RL-based Challenger is validated by the immediate performance leap from the baseline (Base Challenger) to the first iteration of R-Zero. This confirms that the intelligent curriculum generated by the RL-trained Challenger is significantly more effective.

2. Successful Generalization to General Reasoning

Effective Skill Transfer: Reasoning abilities developed through math-focused training successfully transfer to general-domain tasks. This generalization effect was observed across all tested models.
Quantifiable Generalization: After three iterations, the average general-domain score of Qwen3-8B-Base improved by +3.81 points, and OctoThinker-3B improved by +3.65 points.
Enhancement of Core Capabilities: These results confirm that the R-Zero method does not merely teach domain-specific knowledge but enhances the model's underlying capabilities in a way that successfully generalizes across different domains.

BibTeX

@misc{huang2025rzeroselfevolvingreasoningllm, title={R-Zero: Self-Evolving Reasoning LLM from Zero Data}, author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu}, year={2025}, eprint={2508.05004}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.05004}, }