What is MemKV?

High-performance distributed inference context memory store — bridging GPU HBM and NVMe for long-context LLM inference.

As inference models scale to longer contexts and higher concurrency, context memory becomes the bottleneck — HBM fills, prefill dominates, throughput collapses. MemKV is a distributed shared context-memory store that bridges GPU memory and NVMe, scaling linearly as servers are added.

For workstation- and laptop-class deployments — Mac mini, Mac Studio, single-node servers running llama.cpp — MemKV ships as a single kv_store_v1 plugin the inference server loads via dlopen. No RDMA, no NIXL, no transfer-library glue. The same MemKV cluster also serves NVIDIA-stack deployments over NIXL, so one chunk pool is shareable across both worlds.

Key Capabilities

Zero-copy RDMA — Direct transfer between client host memory and NVMe via DC transport (RC fallback for non-Mellanox NICs)
RDMA-native control plane — When an HCA is present, control messages (Allocate, Lookup, Commit, Delete, Exists, Read, Write, BatchRead, BatchWrite) ride RC SEND/RECV on the per-connection QP. The bootstrap (Connect) goes over a long-lived TCP connection because the RC QP doesn't exist yet; once it's up, control switches to RDMA.
First-class TCP transport — RDMA isn't always reachable end to end (routed or multi-hop fabrics, cloud, mixed NICs). For those, MemKV runs the full data path over TCP rather than degrading to a one-request-at-a-time fallback: the client opens a pool of connections per server and pipelines batched reads and writes across them, so a single server's traffic spreads over many flows and sustains high throughput.
macOS support — runs co-located with an inference engine on a Mac (file-mode storage + TCP, since RDMA / JBOF / hugepages aren't compiled in there), useful for laptop / Mac mini dev work
HMAC-SHA256 authentication — Every wire message is signed with a shared key; there is no unauthenticated mode
Native plugin for inference engines — A vendor-neutral kv_store_v1 plugin loads directly into llama.cpp via dlopen; larger NVIDIA-stack engines that use NIXL load MemKV through their existing transfer abstraction
Extent-based block store — Parallelized I/O for large context blocks
Linear Scalability — Shared-nothing architecture; add servers to scale throughput

Why MemKV?


Available Today	Runs on existing infrastructure — no new hardware required
Commodity Hardware	Standard NVMe drives and RDMA NICs
Open Integration	A vendor-neutral `kv_store_v1` plugin for llama.cpp; matching support for NVIDIA-stack engines
Cost Effective	Leverage existing NVMe and RDMA investments
Linear Scalability	Add servers to scale throughput; no coordination overhead

llama.cpp — upstream inference server; v2 chunked slot save fork lives at minio/llama.cpp
NIXL — NVIDIA Inference Xfer Library
Dynamo — NVIDIA Inference Framework

What is MemKV?

Key Capabilities

Why MemKV?

Explore the docs

Quick Start

llama.cpp + MemKV

kv_store_v1 ABI

CLI Reference

Configuration

Monitoring

Architecture

Benchmarks

References

On this page