C
ChaoBro

SU-01: A 30B Model Achieving Gold-Medal Performance on the IMO and IPhO—What’s the Secret Recipe?

What Does “Gold-Medal-Level AI for Olympiads” Actually Mean?

The International Mathematical Olympiad (IMO) and the International Physics Olympiad (IPhO) represent the highest tier of human intellectual competition. Gold medalists are typically among the most brilliant peers worldwide in their age group.

When an AI model claims to reach “gold-medal level,” we must carefully interpret what this means: it does not mean the AI can officially compete and win medals—but rather, its problem-solving accuracy on real past contest problems meets or exceeds the threshold required for a gold medal.

SU-01 achieves precisely this—on IMO 2025, USAMO 2026, and IPhO 2024/2025.

A “Lean” 30B-Parameter Model

Notably, SU-01’s backbone comprises only 30 billion parameters (with 3 billion active at inference—a typical MoE architecture). It is not a trillion-parameter behemoth.

This sends an important signal: for reasoning tasks, training methodology and data quality may matter more than raw parameter count.

Training Recipe: Three Steps

The paper’s core contribution is a “simple and unified recipe,” executed in three stages:

Step 1: Reverse-Perplexity SFT Curriculum

Traditional supervised fine-tuning (SFT) trains models to mimic “correct answers.” SU-01 adopts a different strategy—the reverse-perplexity curriculum.

The intuition: for complex proofs, the model should learn backward search behavior—starting from the conclusion and reasoning backward—rather than merely imitating forward deduction. This trains the model in rigorous proof search and self-checking.

Step 2: Two-Stage RL

  • Stage One: Verifiable-Reward RL. Uses objective, verifiable outcomes as reward signals (e.g., whether the final answer to a math problem is correct).
  • Stage Two: Proof-Level RL. A finer-grained reward mechanism that evaluates not only the final answer but also the quality of the proof process.

Progressing from coarse-grained to fine-grained rewards ensures the model receives clear learning signals early on, without being overwhelmed by overly complex reward functions at the outset.

Step 3: Test-Time Scaling

Increases computational budget during inference (e.g., longer chain-of-thought, more sampling), further boosting problem-solving performance.

Training Data Scale

The SFT stage used ~340K sub-8K-token reasoning trajectories; the RL stage ran for 200 steps. For a 30B model, this volume is modest—but data quality is clearly decisive.

Ultra-Long Reasoning Trajectories

SU-01 stably handles reasoning trajectories exceeding 100K tokens. This means that when tackling the hardest Olympiad problems, the model can sustain “thinking”—generating and verifying intermediate steps—without collapsing after just a few hundred tokens.

Such ultra-long trajectory capability is a necessary condition for Olympiad-level problem solving. A full IMO-level proof may require dozens of inference steps and multiple rounds of self-correction.

Generalization Capability

The paper also reports the model’s generalization performance on scientific reasoning tasks beyond mathematics and physics. While specific metrics are not detailed here, the trend is noteworthy: a training methodology validated on math/physics may transfer to other domains demanding rigorous, step-by-step reasoning.

Evaluation

SU-01’s significance lies not in any single technical novelty, but in its integration of a reproducible, end-to-end training pipeline: from SFT to RL to test-time scaling—each stage grounded in explicit design principles and empirical validation.

For teams building reasoning models, this 77-page technical report reads more like a hands-on manual—it tells you how to execute each step, not just what the final result looks like.


Primary Source:

  • arXiv:2605.13301 SU-01
  • Yafu Li, Runzhe Zhan, Haoran Zhang, Shunkai Zhang, Yizhuo Li, Zhilin Wang, Jiacheng Chen, Futing Wang, Xuyang Hu, Yuchen Fan, Bangjie Xu, Yucheng Su, Xinmiao Han, Chenxi Li, Haodi Lei, Yufeng Zhao, Zejin Lin, Qianjia Cheng, Tong Zhu, Xiaoye Qu, Ganqu Cui, Peng Ye, Yun Luo, Zhouchen Lin, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng (28 authors total)
  • Technical Report, 77 pages