oMLX: Tiered KV Caching for Local LLMs: Memory and SSD Persistence for Agent Context

Local LLM inference servers typically force a choice: either keep everything in memory and hit capacity limits, or evict context and re-compute it on every request. oMLX (15,948 stars, trending #13 on GitHub Python) introduces a two-tier KV cache that splits the difference. Hot keys stay in memory, cold keys move to SSD, and the entire cache persists across sessions. The result is continuous batching without full re-computation when you switch tasks, swap models, or resume a conversation hours later.

This matters for agentic workflows where context changes mid-conversation. A coding agent might start with a file diff, call a tool to fetch logs, then return to the original file. Without persistent caching, each context switch triggers a full re-encode. With tiered caching, the server reuses cached keys from SSD, paying only the I/O cost instead of the compute cost.

Architecture: Hot Memory, Cold SSD, and Eviction Policy

oMLX runs as a Python inference server on Apple Silicon, using the MLX backend for Metal acceleration. The KV cache is split into two tiers:

Hot tier: In-memory cache for recently accessed keys. Latency is sub-millisecond.
Cold tier: SSD-backed cache for evicted keys. Latency is 5-20ms depending on disk speed.

When memory pressure rises, the server evicts least-recently-used keys from the hot tier to the cold tier. The eviction policy is simple LRU with a configurable memory threshold. When a request needs a cold key, the server fetches it from SSD and promotes it back to hot.

The cache persists across server restarts. On startup, the server scans the cold tier directory and rebuilds the index. This means you can shut down the server, reboot your machine, and resume a conversation without re-computing the entire context.

Continuous Batching vs. Request Queuing

Continuous batching is not the same as request queuing. In a queued system, each request waits for the previous one to finish. In a continuous batching system, the server interleaves token generation across multiple requests, filling GPU cycles that would otherwise idle.

oMLX implements continuous batching by maintaining a shared KV cache across all active requests. When a new request arrives, the server checks if any prefix of the prompt already exists in the cache. If it does, the server reuses those keys and only computes the delta. This works because the cache is keyed by token sequence, not by request ID.

The benefit is higher throughput when multiple requests share context. For example, if three users are all working on the same codebase, the server caches the file contents once and reuses them across all three conversations.

Model Swapping and Auto-Load

The macOS menu-bar app lets you pin models in memory or configure them to auto-swap on demand. Pinned models stay loaded until you explicitly unload them. Auto-swap models load when a request arrives and unload after a configurable idle timeout.

Model swapping does not evict the KV cache. When you swap from Llama 3.1 to Qwen 2.5, the cache persists. If you swap back to Llama 3.1, the server reuses the cached keys. This works because the cache is model-agnostic: it stores token IDs and attention states, not model-specific embeddings.

The trade-off is cache invalidation. If you change the tokenizer or switch to a model with a different vocabulary, the cache becomes stale. oMLX does not automatically detect this. You must manually clear the cache or accept the risk of garbage output.

The macOS app is a Swift wrapper around the Python inference server. Communication happens over a local REST API on http://127.0.0.1:8000. The app sends OpenAI-compatible requests to the server and displays the responses in a native UI.

State management is split:

Server state: Active models, KV cache size, memory usage, request queue depth.
App state: User preferences, model pinning rules, auto-update status.

The app polls the server every 2 seconds for metrics and updates the menu bar icon with a visual indicator of memory pressure. If memory usage exceeds 80%, the icon turns yellow. If it exceeds 95%, the icon turns red and the app suggests evicting a model.

The CLI shim at ~/.omlx/bin/omlx is a thin wrapper that forwards commands to the running server. If the server is not running, the shim starts it in the background and waits for the REST API to become available.

Latency Trade-Offs: Memory vs. SSD

The following table shows the latency and capacity trade-offs between hot and cold tiers.

Tier	Latency	Capacity	Persistence
Hot (memory)	<1ms	8-64 GB	Volatile
Cold (SSD)	5-20ms	500GB-4TB	Persistent

The latency gap between hot and cold is significant. A 20ms fetch from SSD adds noticeable delay to the first token of a response. However, this is still faster than re-computing the entire context, which can take seconds for long prompts.

The optimal split depends on your workload. If you frequently switch between tasks, a larger cold tier makes sense. If you stay in a single conversation for hours, a larger hot tier reduces SSD I/O.

oMLX does not expose fine-grained control over eviction policy. You can set the memory threshold, but you cannot pin specific keys or prioritize certain requests. This keeps the implementation simple but limits optimization for advanced use cases.

Failure Modes and Cache Corruption

The most common failure mode is cache corruption after a crash. If the server dies mid-write, the cold tier may contain partial keys. On restart, the server attempts to validate each cached entry by checking a checksum. Invalid entries are discarded.

Another failure mode is cache bloat. If you never clear the cold tier, it will grow indefinitely. oMLX does not implement automatic garbage collection. You must manually delete old cache files or set a cron job to prune entries older than N days.

A third failure mode is tokenizer mismatch. If you update a model to a new version with a different tokenizer, the cache becomes invalid. oMLX does not detect this automatically. You must clear the cache manually or accept the risk of incorrect output.

Code Example: Querying the Cache Metrics API

import requests

try:
    response = requests.get("http://127.0.0.1:8000/v1/cache/metrics", timeout=5)
    response.raise_for_status()
    metrics = response.json()

    print(f"Hot tier size: {metrics['hot_size_mb']} MB")
    print(f"Cold tier size: {metrics['cold_size_mb']} MB")
    print(f"Cache hit rate: {metrics['hit_rate']:.2%}")
    print(f"Eviction count: {metrics['evictions']}")
except requests.exceptions.RequestException as e:
    print(f"Failed to fetch cache metrics: {e}")

The /v1/cache/metrics endpoint returns real-time statistics about the KV cache. You can use this to monitor memory pressure and decide when to evict models or clear the cache.

Deployment Shape

oMLX is designed for single-user, single-machine deployments. It does not support distributed caching or multi-node inference. If you need to scale beyond one machine, you must run multiple independent servers and implement your own load balancing.

The macOS app is the primary interface, but the Python server can run standalone on Linux or in a Docker container. The CLI shim works on any platform, but the menu-bar UI is macOS-only.

Technical Verdict

Use oMLX if:

Your agent context exceeds available RAM by 2-10x but fits on SSD. The tiered cache lets you persist 100GB+ of context without keeping it all in memory.
You need to resume multi-turn conversations without re-computing embeddings. The cold tier survives restarts and model swaps.
You run local LLMs on Apple Silicon and value a native macOS UI over CLI-only tools.
Your workload fits on a single machine and does not require distributed inference.

Avoid oMLX if:

You require sub-5ms latency on all cache accesses. The cold tier adds 5-20ms per fetch, which compounds on long prompts.
You need automatic cache invalidation when tokenizers or model architectures change. oMLX does not detect stale cache entries.
You run inference across multiple machines or need distributed caching. oMLX is single-node only.
You run on non-Apple hardware and need GPU acceleration beyond Metal. The MLX backend is Apple Silicon-specific.

The tiered caching strategy is the key innovation. It makes local LLMs practical for workflows where context changes frequently, without forcing you to choose between memory limits and re-computation overhead. The macOS app adds convenience, but the real value is in the persistence layer.

Source Links

Primary repository: github.com/jundot/omlx
Benchmarks: omlx.ai/benchmarks