C
ChaoBro

omlx: Shoving LLM Inference Into the macOS Menu Bar — SSD Caching + Continuous Batching Redefine Local AI Experience

omlx: Shoving LLM Inference Into the macOS Menu Bar — SSD Caching + Continuous Batching Redefine Local AI Experience

Running an LLM inference server from the menu bar — the idea itself is very "Mac."

omlx gained 1,362 new stars this week, totaling 14.3k. 1,204 forks means not just watching, but actually trying.

Three Core Technical Points

SSD Caching: Apple Silicon's unified memory architecture has an inherent bottleneck — when large model weights don't fit in RAM, performance drops off a cliff. omlx uses SSD as a second-level cache, dramatically reducing model load/unload costs. Simply put: a 16GB MacBook can now run 30B parameter models, with switch speeds in an acceptable range.

Continuous Batching: Not a new concept — vLLM did this first. But bringing continuous batching to menu-bar tool level on Mac — omlx might be the first. Multiple concurrent requests are efficiently scheduled instead of queuing serially.

Menu Bar Management: Seems minor but is actually the key UX differentiator. Not opening a terminal to run commands, but controlling model loading, switching, and monitoring from the menu bar. Whether local inference tools get adopted by non-technical users depends more on this interaction detail than benchmark numbers in papers.

Comparison with Ollama / LM Studio

Three mainstream choices for Mac users running LLMs locally:

Tool Core Positioning Interface SSD Cache Continuous Batching
Ollama General inference server CLI + API ✅ (partial)
LM Studio Desktop GUI GUI
omlx Menu bar server Menu bar + API

omlx's differentiation is clear: the only tool combining SSD caching and continuous batching on Mac.

But Ollama's ecosystem advantage is massive — model support, community docs, tool integration. What omlx needs to catch up on isn't technology, it's ecosystem.

Practical Experience Inference

Without running omlx on an M-series chip, here's what the architecture suggests:

  • 16GB M2/M3: SSD caching is critical. Without it, models above 7B barely run. With it, 13B-30B are possible, speed depending on SSD read/write.
  • 32GB+ M2/M3 Max: RAM is sufficient, SSD cache's marginal benefit drops, but continuous batching still valuable for concurrent API requests.
  • M4 Ultra level: RAM is plentiful, omlx's value is more in menu bar interaction and API service.

Why It's Worth Following

omlx doesn't solve "can it run" — Ollama already proved Mac running LLMs is feasible. It solves "can it be used comfortably."

Menu bar interaction lowers the usage barrier, SSD caching extends the upper limit of runnable models, continuous batching improves throughput in service scenarios. These three points together point to a clear product direction: making Mac a first-class citizen for local AI inference.

14.3k stars shows people are buying into this direction. But how many of the 1,204 forks are actually using it in production remains to be seen.

If Apple announces more about Neural Engine for LLM inference at the next WWDC, tools like omlx will see their value amplified further.