I built an open-source LLM inference server optimized for Apple Silicon. The main motivation was coding agents - tools like Claude Code send requests where the context prefix keeps shifting, invalidating KV cache. A few turns later the agent circles back, and your Mac has to re-prefill the entire context from scratch.
oMLX solves this with paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix returns, it's restored instantly instead of being recomputed. This makes long coding sessions significantly faster.
It also supports continuous batching for concurrent requests, multi-model serving (LLM + embedding + reranker) with LRU eviction, block-level KV cache with prefix sharing and copy-on-write, OpenAI and Anthropic compatible APIs, and tool calling.
Ships as a signed macOS menubar app with a web dashboard.
GitHub: https://github.com/jundot/omlx