Bottom Line
GENERAL365, released April 27, 2026, is a new reasoning benchmark testing LLMs’ ability to solve difficult reasoning puzzles within K-12 knowledge. 365 questions are entirely human-curated, covering complex constraints, nested logic, and semantic interference. Current best models score under 10% — meaning frontier LLMs’ pure reasoning ability (not relying on external knowledge) is far from human level.
Benchmark Design
GENERAL365 differs from traditional reasoning benchmarks in three key ways:
| Feature | MMLU / GSM8K | AIME / FrontierMath | GENERAL365 |
|---|---|---|---|
| Knowledge required | Domain expertise | Math competition level | K-12 basics |
| Question source | Auto-curated | Competition problems | 365 human-designed |
| Tests | Knowledge mastery | Math depth | General logic |
| Distractors | None | None | Semantic interference |
Three test dimensions:
- Complex constraints: Multiple mutually constraining conditions to track simultaneously
- Nested logic: Multi-layer nested conditions (“if A then B, unless C, but when D is true…”)
- Semantic interference: Misleading information tests attention allocation and filtering
Current Performance
Best models score under 10% — fewer than 37 of 365 questions answered correctly by the most advanced LLMs.
This aligns with another finding: LongCoT (long chain-of-thought) benchmark also shows best models under 10%. Both benchmarks point to the same conclusion — LLMs’ long-range reasoning and complex logic processing remain their biggest weakness.
Why This Matters
- Controls knowledge variable: By limiting to K-12 knowledge, it purely tests reasoning ability
- Human-curated, not auto-collected: Avoids “dataset overfitting” common in auto-curated benchmarks
- Semantic interference mirrors real work: Real problems contain noise — GENERAL365 tests this directly
- Code and benchmark are public: Reproducible, extensible, trackable
Selection Guide
| Role | How to use |
|---|---|
| Model vendors | Include GENERAL365 in internal evaluation, track reasoning improvement |
| Researchers | Analyze failure patterns to locate specific reasoning weaknesses |
| Developers | If your app involves complex logic (legal, audit, scheduling), add human review layers |
| Enterprise buyers | Use GENERAL365 scores for model selection — under 5% means unsuitable for high-logic-density scenarios |