R-Zero: Self-Evolving Reasoning LLM from Zero Data

1Tencent AI Seattle Lab, 2Washington University in St. Louis,
3University of Maryland, College Park, 4The University of Texas at Dallas
Abstract

R-Zero teach Large Language Models to reason and evolve on their own, starting with nothing but a base model. No data required.

Abstract

Training powerful reasoning models traditionally requires massive, human-curated datasets, which are expensive and hard to scale. R-Zero is a novel framework that enables LLMs to improve their reasoning abilities autonomously, without needing any pre-existing tasks or labels. It's a truly self-evolving system that learns from scratch.

At its core, R-Zero sets up a dynamic co-evolutionary loop between two instances of the same base model:

  • The Challenger 🎯: Its job is to probe the Solver for weaknesses and generate challenging problems that are right at the edge of its capabilities.
  • The Solver 🧠: Its goal is to continuously improve by solving the increasingly difficult tasks posed by the Challenger.

This process creates a perfectly tailored, adaptive curriculum. The Challenger learns to ask better questions, and the Solver learns to find better answers. The entire cycle is self-contained, using techniques like majority voting for pseudo-labels and relative policy optimization to guide the learning.

Method

Method

At its core, the method orchestrates a co-evolution between a "Challenger" and a "Solver" model.

  • The Challenger: This model is rewarded for generating problems that are precisely at the edge of the Solver's current abilities.
  • The Solver: This model attempts to solve the increasingly difficult tasks posed by the Challenger.
  • Data Generation: For the problems created by the Challenger, reliable pseudo-labels are created by taking a majority vote on the Solver's own attempts. This process constructs a new training dataset that is used to fine-tune the Solver.
Please refer to the paper for more details.

Results

Results on general-domain reasoning benchmarks. The table compares the Base Model, a Base Challenger baseline, and our iterative method (R-Zero). The peak performance achieved during each model's training process is highlighted in bold.
Model Name Overall AVG MATH AVG SuperGPQA MMLU-Pro BBEH
Qwen3-4B-Base
Base Model 27.10 42.58 20.88 37.38 7.57
Base Challenger 30.83 44.36 24.77 47.59 6.59
R-Zero (Iter 1) 34.27 48.06 27.92 51.69 9.42
R-Zero (Iter 2) 34.92 48.44 27.72 53.75 9.76
R-Zero (Iter 3) 34.64 49.07 27.55 51.53 10.42
Qwen3-8B-Base
Base Model 34.49 49.18 28.33 51.80 8.63
Base Challenger 36.43 51.87 30.12 54.14 9.60
R-Zero (Iter 1) 37.93 53.39 31.26 57.17 9.91
R-Zero (Iter 2) 38.45 53.84 31.58 58.20 10.20
R-Zero (Iter 3) 38.73 54.69 31.38 58.23 10.60

1. Substantial Improvements in Mathematical Reasoning

  • Consistent & Significant Gains: The iterative training process consistently and substantially improves upon the performance of base models. For instance, on the Qwen3-8B-Base model, three iterations of R-Zero raised the average performance from a baseline of 49.18 to 54.69, a significant gain of +5.51 points.
  • Progressive Improvement: Performance gains are progressive, with a clear trend of improvement across iterations. This consistent growth underscores the benefits of the co-evolutionary dynamic, where the Solver learns from an increasingly challenging curriculum.
  • Critical Role of RL-Trained Challenger: The necessity of the RL-based Challenger is validated by the immediate performance leap from the baseline (Base Challenger) to the first iteration of R-Zero. This confirms that the intelligent curriculum generated by the RL-trained Challenger is significantly more effective.

2. Successful Generalization to General Reasoning

  • Effective Skill Transfer: Reasoning abilities developed through math-focused training successfully transfer to general-domain tasks. This generalization effect was observed across all tested models.
  • Quantifiable Generalization: After three iterations, the average general-domain score of Qwen3-8B-Base improved by +3.81 points, and OctoThinker-3B improved by +3.65 points.
  • Enhancement of Core Capabilities: These results confirm that the R-Zero method does not merely teach domain-specific knowledge but enhances the model's underlying capabilities in a way that successfully generalizes across different domains.

BibTeX


      @misc{huang2025rzeroselfevolvingreasoningllm,
        title={R-Zero: Self-Evolving Reasoning LLM from Zero Data}, 
        author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu},
        year={2025},
        eprint={2508.05004},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2508.05004}, 
  }