Key Takeaway
The latest GDPval-AA benchmark results for real-world agentic workloads are out, and Xiaomi MiMo-V2.5-Pro takes first place with a score of 1578, ending DeepSeek’s streak in this evaluation. The gap among China’s top five open-source models has narrowed to within 94 points, shifting the competitive landscape from “one dominant player” to “many rising contenders.”
| Model | GDPval-AA Score | Rank | Release Date |
|---|---|---|---|
| Xiaomi MiMo-V2.5-Pro | 1578 | 1 | 2026.05 |
| DeepSeek V4 Pro | 1554 | 2 | 2026.04 |
| GLM 5.1 | 1535 | 3 | 2026.04 |
| MiniMax M2.7 | 1514 | 4 | 2026.04 |
| Kimi K2.6 | 1484 | 5 | 2026.04 |
What Happened
GDPval-AA is a benchmark focused on real-world agentic capabilities. Unlike traditional knowledge quizzes or multiple-choice tests, it evaluates a model’s planning, tool-calling, and multi-step reasoning abilities in practical tasks.
MiMo-V2.5-Pro’s rise to the top sends several key signals:
First, smartphone manufacturers are entering the foundation model battlefield. Xiaomi’s AI presence has historically been concentrated in end-user applications (phone AI assistants, IoT devices), with the MiMo series serving primarily as a supporting model for its own ecosystem. V2.5-Pro breaking into the top tier of open-source benchmarks signals that phone manufacturers are moving from the “AI application layer” into the “foundation model layer.”
Second, the five-way gap is only 94 points. The difference between the top score of 1578 and fifth place at 1484 is just 6%, meaning that on this evaluation dimension, China’s top open-source models have entered a “no absolute king” competitive phase. User choice is no longer determined by benchmark scores alone — API pricing, context window size, and inference speed all factor in.
Cross-Benchmark Comparison: Different Dimensions, Different Winners
GDPval-AA is just one piece of the evaluation puzzle. Across multiple independent benchmarks, the top five models each have their strengths:
| Model | GDPval-AA | SWE-bench | Coding | Chinese | Best Use Case |
|---|---|---|---|---|---|
| MiMo-V2.5-Pro | 1578 | Medium | Above Average | Average | Agentic Workflows |
| DeepSeek V4 Pro | 1554 | High | High | High | All-Around Balanced |
| GLM 5.1 | 1535 | High | High | High | Tool Calling + Chinese |
| MiniMax M2.7 | 1514 | Medium | Medium | Medium | Multimodal |
| Kimi K2.6 | 1484 | Very High | Very High | High | Code Generation |
Kimi K2.6 ranks last on GDPval-AA but excels on SWE-bench (software engineering benchmark) — this demonstrates that different benchmarks reflect different capability dimensions, and model selection must be scenario-specific rather than score-driven.
Landscape Assessment
May 2026 is China’s “super release month” for open-source models. In addition to the five models above, MiniMax M3 is also on the way. This timing isn’t coincidental — every lab is racing to position its product before Google I/O (mid-May) and Anthropic’s developer conference (May 6).
For developers and enterprise users, this is both a “choice overload” period and the best window to evaluate:
- If you need the strongest agentic workflow capability → MiMo-V2.5-Pro is the current pick
- If you need balanced coding + Chinese + tool capabilities → DeepSeek V4 Pro or GLM 5.1
- If you focus on software engineering → Kimi K2.6 remains strongest on SWE-bench
- If you need multimodal capabilities → MiniMax M2.7 deserves testing
Action Items
- Don’t rely on a single benchmark: GDPval-AA focuses on agentic capability, SWE-bench on coding, LMArena on user feel. Reference the benchmark that matches your actual use case.
- Run your own benchmarks: Each model may have uncovered advantages in specific domains. A/B test with your own task set.
- Watch the API price war: As model capabilities converge, price becomes the main differentiator. DeepSeek has already initiated API price cuts — others are expected to follow.