GENERAL365 Benchmark Released: A New Ruler for General Reasoning

Bottom Line

GENERAL365, released April 27, 2026, is a new reasoning benchmark testing LLMs’ ability to solve difficult reasoning puzzles within K-12 knowledge. 365 questions are entirely human-curated, covering complex constraints, nested logic, and semantic interference. Current best models score under 10% — meaning frontier LLMs’ pure reasoning ability (not relying on external knowledge) is far from human level.

Benchmark Design

GENERAL365 differs from traditional reasoning benchmarks in three key ways:

Feature	MMLU / GSM8K	AIME / FrontierMath	GENERAL365
Knowledge required	Domain expertise	Math competition level	K-12 basics
Question source	Auto-curated	Competition problems	365 human-designed
Tests	Knowledge mastery	Math depth	General logic
Distractors	None	None	Semantic interference

Three test dimensions:

Complex constraints: Multiple mutually constraining conditions to track simultaneously
Nested logic: Multi-layer nested conditions (“if A then B, unless C, but when D is true…”)
Semantic interference: Misleading information tests attention allocation and filtering

Current Performance

Best models score under 10% — fewer than 37 of 365 questions answered correctly by the most advanced LLMs.

This aligns with another finding: LongCoT (long chain-of-thought) benchmark also shows best models under 10%. Both benchmarks point to the same conclusion — LLMs’ long-range reasoning and complex logic processing remain their biggest weakness.

Why This Matters

Controls knowledge variable: By limiting to K-12 knowledge, it purely tests reasoning ability
Human-curated, not auto-collected: Avoids “dataset overfitting” common in auto-curated benchmarks
Semantic interference mirrors real work: Real problems contain noise — GENERAL365 tests this directly
Code and benchmark are public: Reproducible, extensible, trackable

Selection Guide

Role	How to use
Model vendors	Include GENERAL365 in internal evaluation, track reasoning improvement
Researchers	Analyze failure patterns to locate specific reasoning weaknesses
Developers	If your app involves complex logic (legal, audit, scheduling), add human review layers
Enterprise buyers	Use GENERAL365 scores for model selection — under 5% means unsuitable for high-logic-density scenarios

Bottom Line

Benchmark Design

Current Performance

Why This Matters

Selection Guide

Sources

Related

Kimi K2.6 Tops Design Arena: Moonshot AI Surpasses All US Models in 3D Design

Qwen 3.6 Max BS Benchmark Review: Anti-Hallucination Capability Surpasses All OpenAI Models

Oxford/LLNL Chain-of-Thought Benchmark: GPT 95.7% Single, Collapses to 9.83% Chained