What Happened
Qwen3.5/3.6 series models support thinking mode, but in practice they often overthink severely — the model generates massive redundant reasoning steps inside <think> tags, causing token consumption to explode and response times to slow down, without delivering corresponding accuracy improvements.
On April 28, a post on X that received 317 likes and 514 bookmarks offered a solution: a Grammar-based constraint rule that can reduce think token consumption by up to 22x for Qwen series models, while maintaining accuracy.
How It Works
The core idea is to force the model to follow a structured reasoning format during the thinking phase, rather than rambling endlessly, through Grammar rules.
The implementation uses an EBNF-style root rule:
root ::= think code
think ::= "<think>\n" "GOAL: " line "APPROACH: " line "EDGE: "
This rule forces the model to do only three things within the <think> block:
| Step | Content | Purpose |
|---|---|---|
| GOAL | Define the objective | Prevent going off-topic |
| APPROACH | Outline the method | Constrain the reasoning path |
| EDGE | Specify boundary conditions | Prevent over-elaboration |
Once the model follows this structure, it won’t endlessly “talk to itself” — the think phase token count drops from thousands to hundreds.
Why It Matters
Token Economics Perspective
For API users, think tokens directly equal cost. Overthinking not only slows down response times but also multiplies the cost of each call. Reducing think tokens by 22x means:
- Direct cost reduction: Significantly lower per-call API costs
- Speed improvement: Shorter reasoning chains = faster time-to-first-token
- Better UX: Users no longer wait for the model to “monologue”
Significance for Qwen Ecosystem
Qwen3.5/3.6’s thinking mode is a double-edged sword: enabling it significantly boosts reasoning capability, but the token consumption deters many users. This solution essentially “unlocks” the practicality of thinking mode without modifying model weights — just by constraining output at inference time.
Performance Comparison
| Metric | Thinking Mode (Original) | + Grammar Constraint |
|---|---|---|
| Think Token Count | ~2000-5000 | ~100-250 |
| Accuracy | Baseline | Essentially unchanged |
| Response Time | Long (waiting for many think tokens) | Short |
| API Cost | High | Dramatically reduced |
How to Get Started
- Use a reasoning framework that supports Grammar constraints: such as llama.cpp, vLLM (with guided decoding), or Ollama
- Inject the Grammar rule into your requests
- Compare token consumption and accuracy before and after enabling
For teams already deploying Qwen3.5/3.6 in production, this solution can be implemented at near-zero cost — no model retraining needed, just a change in inference configuration.
Landscape Assessment
This reflects a broader trend: inference-time optimization is becoming as important as model training. Rather than spending months retraining a model that “doesn’t overthink,” a few dozen lines of rules can constrain output at inference time.
Similar approaches may expand to more scenarios in the future: controlling output length, constraining reasoning style, guiding structured responses, and more. The Qwen ecosystem is leading the way in this area.