Long context has long been a weak spot for LLMs. Extending the context window to 128K, 256K, or even 1M tokens is technically straightforward—but enabling models to truly understand information within long texts and perform correct reasoning remains challenging.
GoLongRL takes an intriguing approach: rather than scaling up parameters, it teaches models to handle long contexts via reinforcement learning + diverse reward signals. Crucially, it is fully open-source—its dataset, training code, and full pipeline are all publicly available.
Problem Awareness: Blind Spots in Existing Methods
The paper identifies a common flaw in current long-context RL approaches: equating data construction with “designing increasingly complex retrieval paths.” This leads to homogeneous task coverage and reward formulations that fail to reflect real long-context requirements.
Analogy: Teaching a student to read long articles isn’t about repeatedly drilling keyword lookup—it’s about cultivating diverse long-text processing capabilities, such as summarization, reasoning, comparison, extraction, and localization.
Two Core Contributions
1. Capability-Oriented Data Construction
The team releases an RLVR dataset of 23K samples, covering 9 distinct task types—each paired with natural, task-aligned evaluation metrics.
Data sources fall into two categories:
- Curated open samples drawn from mature corpora
- Synthesized QA pairs derived from authentic source documents (books, academic papers, multi-turn dialogues)
Under identical vanilla GRPO settings, this dataset alone outperforms the closed-source QwenLong-L1.5 dataset.
2. TMN-Reweight: Multi-Task Reward Re-weighting
Since tasks differ in difficulty and importance, GoLongRL proposes a Task–Metric–Network (TMN) re-weighting method, enabling the model to automatically adjust its focus across tasks during training.
Performance Results
The reported results are striking:
- Qwen3-30B-A3B matches the long-context performance of DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507
- A 30B-parameter model vs. a 235B-parameter model—a nearly 8× parameter gap
- The dataset alone surpasses closed-source baselines under GRPO
Why It Matters
This work demonstrates a critical insight: long-context capability is not solely a function of parameter count. With carefully designed data and training methodology, medium-scale models can achieve state-of-the-art performance on long-context tasks.
Even more importantly, its full openness—not just model weights, but the entire training pipeline and data—enables the community to reproduce, improve, and extend the work.
Paper link: arXiv:2605.19577