What is MemKV?
High-performance distributed inference context memory store — bridging GPU HBM and NVMe for long-context LLM inference.
As inference models scale to longer contexts and higher concurrency, context memory becomes the bottleneck — HBM fills, prefill dominates, throughput collapses. MemKV is a distributed shared context-memory store at the G3.5 layer that bridges GPU memory and NVMe, scaling linearly as servers are added.
For workstation- and laptop-class deployments — Mac mini, Mac Studio, single-node servers running
llama.cpp — MemKV ships as a single
kv_store_v1 plugin the inference server loads via dlopen. No RDMA, no NIXL, no transfer-library
glue. The same MemKV cluster also serves NVIDIA-stack deployments over NIXL, so one chunk pool is shareable across
both worlds.
Key Capabilities
- Zero-copy RDMA — Direct transfer between client host memory and NVMe via DC transport (RC fallback for non-Mellanox NICs)
- RDMA-native control plane — When an HCA is present, control messages (Allocate, Lookup, Commit, Delete, Exists, Read, Write, BatchRead, BatchWrite) ride RC SEND/RECV on the per-connection QP. The bootstrap (
Connect) goes over a long-lived TCP connection because the RC QP doesn't exist yet; once it's up, control switches to RDMA. - First-class TCP transport — RDMA isn't always reachable end to end (routed or multi-hop fabrics, cloud, mixed NICs). For those, memkv runs the full data path over TCP rather than degrading to a one-request-at-a-time fallback: the client opens a pool of connections per server and pipelines batched reads and writes across them, so a single server's traffic spreads over many flows and sustains high throughput.
- macOS support — runs co-located with an inference engine on a Mac (file-mode storage + TCP, since RDMA / JBOF / hugepages aren't compiled in there), useful for laptop / Mac Mini dev work
- HMAC-SHA256 authentication — Every wire message is signed with a shared key; there is no unauthenticated mode
- Native plugin for inference engines — A vendor-neutral
kv_store_v1plugin loads directly into llama.cpp viadlopen; larger NVIDIA-stack engines that use NIXL load MemKV through their existing transfer abstraction - Extent-based block store — Parallelized I/O for large context blocks
- Linear Scalability — Shared-nothing architecture; add servers to scale throughput
Why MemKV?
| Available Today | Runs on existing infrastructure — no new hardware required |
| Commodity Hardware | Standard NVMe drives and RDMA NICs |
| Open Integration | A vendor-neutral kv_store_v1 plugin for llama.cpp; matching support for NVIDIA-stack engines |
| Cost Effective | Leverage existing NVMe and RDMA investments |
| Linear Scalability | Add servers to scale throughput; no coordination overhead |
Explore the docs
Quick Start
Install MemKV, initialize drives, and start the server in under five minutes.
llama.cpp + MemKV
Persistent KV store for llama-server. Multi-turn chats and agent loops resume from disk instead of re-prefilling tokens.
kv_store_v1 ABI
Vendor-neutral C ABI llama.cpp adopts via dlopen. Eight function pointers (chunk put/get, manifest put/get/delete, plus prefetch in v2), no RDMA assumed, loadable on macOS and Linux.
CLI Reference
Every memkv subcommand and flag — setup, start, doc, admin, drive management.
Configuration
Complete reference for /etc/memkv/config.yaml, the client MEMKV_CONFIG, and every MEMKV_* env var.
Monitoring
Health probes, Prometheus metrics, and Kubernetes integration.
Architecture
Three deployment shapes — distributed, co-located, file-mode — and the shared-nothing model behind linear scaling.
Benchmarks
Measured 96.7 GiB/s read peak on 2 servers — ~97% of 2× 400GbE line rate, with per-block-size numbers.
References
- llama.cpp — upstream inference server; v2 chunked slot save fork lives at minio/llama.cpp
- NIXL — NVIDIA Inference Xfer Library
- Dynamo — NVIDIA Inference Framework