GoLongRL: An Open-Source Long-Context RL Training Framework—30B Model Matches DeepSeek-R1-0528 Performance

Long context has long been a weak spot for LLMs. Extending the context window to 128K, 256K, or even 1M tokens is technically straightforward—but enabling models to truly understand information within long texts and perform correct reasoning remains challenging.

GoLongRL takes an intriguing approach: rather than scaling up parameters, it teaches models to handle long contexts via reinforcement learning + diverse reward signals. Crucially, it is fully open-source—its dataset, training code, and full pipeline are all publicly available.

Problem Awareness: Blind Spots in Existing Methods

The paper identifies a common flaw in current long-context RL approaches: equating data construction with “designing increasingly complex retrieval paths.” This leads to homogeneous task coverage and reward formulations that fail to reflect real long-context requirements.

Analogy: Teaching a student to read long articles isn’t about repeatedly drilling keyword lookup—it’s about cultivating diverse long-text processing capabilities, such as summarization, reasoning, comparison, extraction, and localization.

Two Core Contributions

1. Capability-Oriented Data Construction

The team releases an RLVR dataset of 23K samples, covering 9 distinct task types—each paired with natural, task-aligned evaluation metrics.

Data sources fall into two categories:

Curated open samples drawn from mature corpora
Synthesized QA pairs derived from authentic source documents (books, academic papers, multi-turn dialogues)

Under identical vanilla GRPO settings, this dataset alone outperforms the closed-source QwenLong-L1.5 dataset.

2. TMN-Reweight: Multi-Task Reward Re-weighting

Since tasks differ in difficulty and importance, GoLongRL proposes a Task–Metric–Network (TMN) re-weighting method, enabling the model to automatically adjust its focus across tasks during training.

Performance Results

The reported results are striking:

Qwen3-30B-A3B matches the long-context performance of DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507
A 30B-parameter model vs. a 235B-parameter model—a nearly 8× parameter gap
The dataset alone surpasses closed-source baselines under GRPO

Why It Matters

This work demonstrates a critical insight: long-context capability is not solely a function of parameter count. With carefully designed data and training methodology, medium-scale models can achieve state-of-the-art performance on long-context tasks.

Even more importantly, its full openness—not just model weights, but the entire training pipeline and data—enables the community to reproduce, improve, and extend the work.

Paper link: arXiv:2605.19577

Problem Awareness: Blind Spots in Existing Methods

Two Core Contributions

1. Capability-Oriented Data Construction

2. TMN-Reweight: Multi-Task Reward Re-weighting

Performance Results

Why It Matters

Related

CiteVQA: OpenDataLab's Document Intelligence Benchmark Makes Every AI Citation Verifiable

CLI-Anything Surges by 1,000 Stars in a Week: Making All Software "Agent-Native," A New Approach from the HKU Team

MMSkills: SJTU Decomposes Visual Agent Capabilities into a "Skill Pack"—A New Paradigm for Multimodal Agents