Key Conclusion
The focus of AI infrastructure competition is fundamentally shifting: from GPU compute cores to HBM (High Bandwidth Memory) capacity and bandwidth. This is based on two key signals:
- Wuhan 260B RMB storage expansion: YMTC Phase 3 + Wuhan Xinxin expansion, targeting 3D NAND and DRAM, expected to mass produce by end of 2026
- Token economics first principles: GPU architecture evolution shows HBM demand per GPU will grow exponentially and won’t stop
Why HBM Is the New Bottleneck
In AI inference and training, GPU compute is no longer the limiting factor. The real bottleneck is the speed of data movement from memory to compute units.
First principles derivation:
Token throughput = HBM capacity × HBM bandwidth / model parameters
Why HBM Demand Won’t Stop
| Driver | Explanation | Impact |
|---|---|---|
| Model size growth | Frontier model parameters continue growing | Single GPU needs more HBM capacity |
| Context length expansion | 1M token context becoming standard | KV Cache consumes大量 HBM |
| Multimodal input | Images/video/audio processed simultaneously | Intermediate activations explode |
| Agent workflows | Multi-round tool calls maintain state | HBM usage accumulates during inference |
Investment & Action Recommendations
For chip industry
- HBM supply chain is a more certain growth track than GPU chips — all GPU vendors need HBM, but capacity is concentrated in 3 companies
For AI application developers
- Consider HBM requirements when choosing models: Bigger isn’t always better if HBM shortage causes swapping
- True cost of 1M context: Long context doesn’t just consume more tokens — it needs more HBM for KV Cache
For investors
- Storage semiconductor expansion is the “second wave” of AI infrastructure investment — first wave was GPUs, second wave is HBM and storage