GENERAL365 Benchmark Released: A New Ruler for General Reasoning

GENERAL365 Benchmark Released: A New Ruler for General Reasoning

Bottom Line

GENERAL365, released April 27, 2026, is a new reasoning benchmark testing LLMs’ ability to solve difficult reasoning puzzles within K-12 knowledge. 365 questions are entirely human-curated, covering complex constraints, nested logic, and semantic interference. Current best models score under 10% — meaning frontier LLMs’ pure reasoning ability (not relying on external knowledge) is far from human level.

Benchmark Design

GENERAL365 differs from traditional reasoning benchmarks in three key ways:

FeatureMMLU / GSM8KAIME / FrontierMathGENERAL365
Knowledge requiredDomain expertiseMath competition levelK-12 basics
Question sourceAuto-curatedCompetition problems365 human-designed
TestsKnowledge masteryMath depthGeneral logic
DistractorsNoneNoneSemantic interference

Three test dimensions:

  1. Complex constraints: Multiple mutually constraining conditions to track simultaneously
  2. Nested logic: Multi-layer nested conditions (“if A then B, unless C, but when D is true…”)
  3. Semantic interference: Misleading information tests attention allocation and filtering

Current Performance

Best models score under 10% — fewer than 37 of 365 questions answered correctly by the most advanced LLMs.

This aligns with another finding: LongCoT (long chain-of-thought) benchmark also shows best models under 10%. Both benchmarks point to the same conclusion — LLMs’ long-range reasoning and complex logic processing remain their biggest weakness.

Why This Matters

  1. Controls knowledge variable: By limiting to K-12 knowledge, it purely tests reasoning ability
  2. Human-curated, not auto-collected: Avoids “dataset overfitting” common in auto-curated benchmarks
  3. Semantic interference mirrors real work: Real problems contain noise — GENERAL365 tests this directly
  4. Code and benchmark are public: Reproducible, extensible, trackable

Selection Guide

RoleHow to use
Model vendorsInclude GENERAL365 in internal evaluation, track reasoning improvement
ResearchersAnalyze failure patterns to locate specific reasoning weaknesses
DevelopersIf your app involves complex logic (legal, audit, scheduling), add human review layers
Enterprise buyersUse GENERAL365 scores for model selection — under 5% means unsuitable for high-logic-density scenarios

Sources