C
ChaoBro

GENERAL365 Benchmark Released: A New Ruler for General Reasoning

GENERAL365 Benchmark Released: A New Ruler for General Reasoning

Bottom Line

GENERAL365, released April 27, 2026, is a new reasoning benchmark testing LLMs' ability to solve difficult reasoning puzzles within K-12 knowledge. 365 questions are entirely human-curated, covering complex constraints, nested logic, and semantic interference. Current best models score under 10% — meaning frontier LLMs' pure reasoning ability (not relying on external knowledge) is far from human level.

Benchmark Design

GENERAL365 differs from traditional reasoning benchmarks in three key ways:

Feature MMLU / GSM8K AIME / FrontierMath GENERAL365
Knowledge required Domain expertise Math competition level K-12 basics
Question source Auto-curated Competition problems 365 human-designed
Tests Knowledge mastery Math depth General logic
Distractors None None Semantic interference

Three test dimensions:

  1. Complex constraints: Multiple mutually constraining conditions to track simultaneously
  2. Nested logic: Multi-layer nested conditions ("if A then B, unless C, but when D is true...")
  3. Semantic interference: Misleading information tests attention allocation and filtering

Current Performance

Best models score under 10% — fewer than 37 of 365 questions answered correctly by the most advanced LLMs.

This aligns with another finding: LongCoT (long chain-of-thought) benchmark also shows best models under 10%. Both benchmarks point to the same conclusion — LLMs' long-range reasoning and complex logic processing remain their biggest weakness.

Why This Matters

  1. Controls knowledge variable: By limiting to K-12 knowledge, it purely tests reasoning ability
  2. Human-curated, not auto-collected: Avoids "dataset overfitting" common in auto-curated benchmarks
  3. Semantic interference mirrors real work: Real problems contain noise — GENERAL365 tests this directly
  4. Code and benchmark are public: Reproducible, extensible, trackable

Selection Guide

Role How to use
Model vendors Include GENERAL365 in internal evaluation, track reasoning improvement
Researchers Analyze failure patterns to locate specific reasoning weaknesses
Developers If your app involves complex logic (legal, audit, scheduling), add human review layers
Enterprise buyers Use GENERAL365 scores for model selection — under 5% means unsuitable for high-logic-density scenarios

Sources