MemKV

What is MemKV?

High-performance distributed inference context memory store — bridging GPU HBM and NVMe for long-context LLM inference.

As inference models scale to longer contexts and higher concurrency, context memory becomes the bottleneck — HBM fills, prefill dominates, throughput collapses. MemKV is a distributed shared context-memory store at the G3.5 layer that bridges GPU memory and NVMe, scaling linearly as servers are added.

For workstation- and laptop-class deployments — Mac mini, Mac Studio, single-node servers running llama.cpp — MemKV ships as a single kv_store_v1 plugin the inference server loads via dlopen. No RDMA, no NIXL, no transfer-library glue. The same MemKV cluster also serves NVIDIA-stack deployments over NIXL, so one chunk pool is shareable across both worlds.

Key Capabilities

  • Zero-copy RDMA — Direct transfer between client host memory and NVMe via DC transport (RC fallback for non-Mellanox NICs)
  • RDMA-native control plane — When an HCA is present, control messages (Allocate, Lookup, Commit, Delete, Exists, Read, Write, BatchRead, BatchWrite) ride RC SEND/RECV on the per-connection QP. The bootstrap (Connect) goes over a long-lived TCP connection because the RC QP doesn't exist yet; once it's up, control switches to RDMA.
  • First-class TCP transport — RDMA isn't always reachable end to end (routed or multi-hop fabrics, cloud, mixed NICs). For those, memkv runs the full data path over TCP rather than degrading to a one-request-at-a-time fallback: the client opens a pool of connections per server and pipelines batched reads and writes across them, so a single server's traffic spreads over many flows and sustains high throughput.
  • macOS support — runs co-located with an inference engine on a Mac (file-mode storage + TCP, since RDMA / JBOF / hugepages aren't compiled in there), useful for laptop / Mac Mini dev work
  • HMAC-SHA256 authentication — Every wire message is signed with a shared key; there is no unauthenticated mode
  • Native plugin for inference engines — A vendor-neutral kv_store_v1 plugin loads directly into llama.cpp via dlopen; larger NVIDIA-stack engines that use NIXL load MemKV through their existing transfer abstraction
  • Extent-based block store — Parallelized I/O for large context blocks
  • Linear Scalability — Shared-nothing architecture; add servers to scale throughput

Why MemKV?

Available TodayRuns on existing infrastructure — no new hardware required
Commodity HardwareStandard NVMe drives and RDMA NICs
Open IntegrationA vendor-neutral kv_store_v1 plugin for llama.cpp; matching support for NVIDIA-stack engines
Cost EffectiveLeverage existing NVMe and RDMA investments
Linear ScalabilityAdd servers to scale throughput; no coordination overhead

Explore the docs

References