Xiaomi MiMo-V2.5-Pro Tops GDPval-AA Benchmark, China Open-Source Model Landscape Reshaped

Key Takeaway

The latest GDPval-AA benchmark results for real-world agentic workloads are out, and Xiaomi MiMo-V2.5-Pro takes first place with a score of 1578, ending DeepSeek’s streak in this evaluation. The gap among China’s top five open-source models has narrowed to within 94 points, shifting the competitive landscape from “one dominant player” to “many rising contenders.”

Model	GDPval-AA Score	Rank	Release Date
Xiaomi MiMo-V2.5-Pro	1578	1	2026.05
DeepSeek V4 Pro	1554	2	2026.04
GLM 5.1	1535	3	2026.04
MiniMax M2.7	1514	4	2026.04
Kimi K2.6	1484	5	2026.04

What Happened

GDPval-AA is a benchmark focused on real-world agentic capabilities. Unlike traditional knowledge quizzes or multiple-choice tests, it evaluates a model’s planning, tool-calling, and multi-step reasoning abilities in practical tasks.

MiMo-V2.5-Pro’s rise to the top sends several key signals:

First, smartphone manufacturers are entering the foundation model battlefield. Xiaomi’s AI presence has historically been concentrated in end-user applications (phone AI assistants, IoT devices), with the MiMo series serving primarily as a supporting model for its own ecosystem. V2.5-Pro breaking into the top tier of open-source benchmarks signals that phone manufacturers are moving from the “AI application layer” into the “foundation model layer.”

Second, the five-way gap is only 94 points. The difference between the top score of 1578 and fifth place at 1484 is just 6%, meaning that on this evaluation dimension, China’s top open-source models have entered a “no absolute king” competitive phase. User choice is no longer determined by benchmark scores alone — API pricing, context window size, and inference speed all factor in.

Cross-Benchmark Comparison: Different Dimensions, Different Winners

GDPval-AA is just one piece of the evaluation puzzle. Across multiple independent benchmarks, the top five models each have their strengths:

Model	GDPval-AA	SWE-bench	Coding	Chinese	Best Use Case
MiMo-V2.5-Pro	1578	Medium	Above Average	Average	Agentic Workflows
DeepSeek V4 Pro	1554	High	High	High	All-Around Balanced
GLM 5.1	1535	High	High	High	Tool Calling + Chinese
MiniMax M2.7	1514	Medium	Medium	Medium	Multimodal
Kimi K2.6	1484	Very High	Very High	High	Code Generation

Kimi K2.6 ranks last on GDPval-AA but excels on SWE-bench (software engineering benchmark) — this demonstrates that different benchmarks reflect different capability dimensions, and model selection must be scenario-specific rather than score-driven.

Landscape Assessment

May 2026 is China’s “super release month” for open-source models. In addition to the five models above, MiniMax M3 is also on the way. This timing isn’t coincidental — every lab is racing to position its product before Google I/O (mid-May) and Anthropic’s developer conference (May 6).

For developers and enterprise users, this is both a “choice overload” period and the best window to evaluate:

If you need the strongest agentic workflow capability → MiMo-V2.5-Pro is the current pick
If you need balanced coding + Chinese + tool capabilities → DeepSeek V4 Pro or GLM 5.1
If you focus on software engineering → Kimi K2.6 remains strongest on SWE-bench
If you need multimodal capabilities → MiniMax M2.7 deserves testing

Action Items

Don’t rely on a single benchmark: GDPval-AA focuses on agentic capability, SWE-bench on coding, LMArena on user feel. Reference the benchmark that matches your actual use case.
Run your own benchmarks: Each model may have uncovered advantages in specific domains. A/B test with your own task set.
Watch the API price war: As model capabilities converge, price becomes the main differentiator. DeepSeek has already initiated API price cuts — others are expected to follow.

Key Takeaway

What Happened

Cross-Benchmark Comparison: Different Dimensions, Different Winners

Landscape Assessment

Action Items

相关内容

GPT-6 Enters Safety Alignment Phase: 5-6 Trillion Parameters, Math Reasoning 92.5%, Code Pass Rate 96.8%

MiniMax M3 Launching This Month: Targeting Office Scenarios with Major Agentic Capability Upgrades

GLM-5.1 Lands on 0G Private Computer: What Running a 754B MoE Model Inside a TEE Means