Show HN: oMLX – Native Mac inference server that persists KV cache to SSD

1 point

4 months ago

I built an open-source LLM inference server optimized for Apple Silicon. The main motivation was coding agents - tools like Claude Code send requests where the context prefix keeps shifting, invalidating KV cache. A few turns later the agent circles back, and your Mac has to re-prefill the entire context from scratch.

oMLX solves this with paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix returns, it's restored instantly instead of being recomputed. This makes long coding sessions significantly faster.

It also supports continuous batching for concurrent requests, multi-model serving (LLM + embedding + reranker) with LRU eviction, block-level KV cache with prefix sharing and copy-on-write, OpenAI and Anthropic compatible APIs, and tool calling.

Ships as a signed macOS menubar app with a web dashboard.

GitHub: https://github.com/jundot/omlx

No comments

No comments