C
ChaoBro

Qwen3.6 Heretic 35B: Community Fine-Tune Cuts Refusals, Runs on RTX 4090

Qwen3.6 Heretic 35B: Community Fine-Tune Cuts Refusals, Runs on RTX 4090

Bottom Line

Qwen3.6 Heretic 35B is the hottest community fine-tune right now. Based on Alibaba’s Qwen3.6-35B, it significantly reduces safety refusal rates while maintaining the original model’s intelligence level. Quantized versions run on consumer-grade RTX 3090/4090 GPUs with 260K context for Agent tasks.

What Happened

In late April, the community released Qwen3.6 Heretic 35B, a targeted fine-tune of the Qwen3.6-35B base model. Key specs:

DimensionQwen3.6-35B OriginalQwen3.6 Heretic 35B
IntelligenceBaselineMaintained
Safety Refusal RateHighSignificantly reduced
Max Context260K tokens260K tokens
HardwareMulti-GPU/A100RTX 3090/4090 (quantized)
Agent Tool UseSupportedSmoother
LicenseOpenOpen

On the DGX-Spark leaderboard, quantized versions of Qwen3.6-35B hit 95 tps, 92 tps, and 73 tps inference speeds, outperforming gpt-oss-120B and gemma4-26B.

Why “Fewer Refusals” Matters

For developers, the original Qwen3.6 triggers excessive safety refusals on edge cases — fatal in Agent workflows:

  • Code Generation: System-level or network request code gets refused
  • Data Processing: Data cleaning tasks with sensitive field names get blocked
  • Agent Tool Calling: Certain MCP tool parameter combinations trigger safety filters

Heretic dramatically reduces these “false positives” through community fine-tuning, without degrading core capabilities:

  1. More stable Agent workflows: Fewer task interruptions from refusals
  2. Better debugging: No need to rewrite prompts to bypass safety filters
  3. Local deployment friendly: Consumer GPUs suffice, no cloud API needed

Deployment Guide

Quantization Options

FormatVRAMSpeedPrecision Loss
Q4_K_M~20GB95 tpsMinimal
Q5_K_M~22GB92 tpsNegligible
Q6_K~26GB73 tpsAlmost none

RTX 4090 (24GB): Q4_K_M or Q5_K_M. RTX 3090 (24GB): same.

  • LM Studio: Auto-discovers models, zero-config loading
  • Ollama: One command ollama run qwen3.6-heretic-35b
  • vLLM: Production deployment, high concurrency

Landscape Assessment

Qwen3.6 Heretic reflects two trends:

  1. Community fine-tune ecosystem maturing: The last mile from “usable” to “great” is filled by the community
  2. Consumer GPU inference going mainstream: 35B-class models now run smoothly on single consumer GPUs

Compared to peers:

  • Kimi K2.6 (1T MoE, 32B active) focuses on Agent swarm capabilities
  • DeepSeek-V4-Pro wins on API cost-effectiveness
  • Qwen3.6 Heretic differentiates on local deployment + low refusal rate

Action Items

  • RTX 3090/4090 owners: Deploy now, replace your existing Qwen3.6 base
  • Agent developers: Heretic is more stable in tool-calling scenarios
  • Enterprise users: Note Heretic is a community fine-tune with adjusted safety policies — assess compliance risk
  • A/B test: Compare with original Qwen3.6-35B in your specific use cases