Core Finding
A user ran Qwen3.6:27b on a home server in a recursive self-optimization experiment, improving inference speed from 2.3 tok/s to 84.3 tok/s over 26 hours — a 36x increase. This wasn’t on a GPU cluster, but on a standard home server.
Experiment Environment
| Component | Configuration |
|---|---|
| CPU | 24 threads |
| Memory | 93 GiB RAM |
| GPU | AMD 9060 XT 16GB |
| Model | Qwen3.6:27b |
| Optimization Method | Recursive self-optimization loop |
| Total Time | 26 hours |
Key Details
At the start, the model ran at 2.3 tok/s on the server — typical CPU inference speed. The user then had Qwen3.6 “optimize itself” in a recursive loop: the model analyzed its runtime environment, detected no NVIDIA GPU, only a CPU/RAM setup with an AMD graphics card, and then performed targeted optimizations.
Significance of This Experiment
For Open-Source Model Ecosystem
Qwen3.6 27B is already a powerful open-source model (Intelligence Index score of 46, #1 among open-source models under 150B parameters), but this experiment reveals another dimension of potential: models can not only reason, but also optimize their own reasoning process.
This marks a shift from “passive usage” to “active adaptation” for open-source models. The model isn’t just deployed in an environment to run — it can perceive the environment and self-tune.
Implications for Local Deployment
Many users hit performance bottlenecks when deploying large models locally, and their first reaction is “I need a better GPU.” But this experiment shows that with the right optimization strategy, consumer-grade equipment can achieve usable inference speeds on existing hardware.
84.3 tok/s approaches the response speed of many cloud APIs, meaning for individual users, local deployment is no longer a “usable but slow” compromise.
Cost Comparison
| Solution | Hardware Cost | Inference Speed | Ongoing Cost |
|---|---|---|---|
| Cloud API (Qwen3.6 Max) | $0 | Very high | Per-token billing |
| Cloud API (Claude Opus 4.7) | $0 | Very high | $25/1M output |
| Local (before optimization) | ~$2,500 (server) | 2.3 tok/s | Electricity |
| Local (after optimization) | ~$2,500 (server) | 84.3 tok/s | Electricity |
Action Recommendations
- Users with AMD GPUs: This experiment proves the feasibility of running large models on AMD GPUs. If you have a 16GB+ AMD card, it’s worth trying.
- Qwen3.6 users: Try having the model self-diagnose and optimize after deployment — you may get unexpected performance improvements.
- Watch the recursive optimization direction: This is an important trend in the open-source model ecosystem — models that can not only reason but optimize their own execution. More automation tools will likely emerge.