Architecture
How MemKV is shaped — the three deployment topologies, the shared-nothing model, and where the data lives.
MemKV integrates with inference frameworks via NVIDIA NIXL and ships in three deployment shapes that share
the same wire protocol, auth model, and memkv-client engine. Only storage backend and transport differ.
FIG. 01 — Distributed (multi-server, RDMA + JBOF)
The performance path. Multi-server fleet, RDMA over DC transport, NVMe via io_uring + O_DIRECT,
hugepage-backed bounce buffers. The shape benchmarked at 96.7 GiB/s.
FIG. 01·DR — Co-located with drives (single GPU node)
One MemKV server on the same host as the inference engine, with a few local NVMe drives.
Same internals as the distributed shape — RDMA + DC + JBOF + io_uring — but the data
plane stays inside one chassis, with the local NIC moving bytes between memory regions
without traversing the IP stack.
FIG. 01·FL — Co-located, file mode (no raw drives)
For hosts without raw drives to dedicate: developer laptops, Apple Silicon Macs, edge AI nodes, kind / CI clusters. File-mode storage on a regular filesystem, TCP-only transport, no RDMA, no JBOF, no hugepages. Same auth, same license, same on-the-wire format. Bandwidth is bounded by the host filesystem and NIC.
The server runs on commodity x86 / ARM Linux hosts with NVMe drives and a mlx5 RDMA NIC for the distributed shape. macOS and Apple Silicon are supported for the file-mode co-located shape only.
Shared-Nothing Architecture
No gossip. No quorum. No rebalance. Every MemKV server is a self-contained unit. Aggregation is a client-side concern — the NIXL plugin shards across servers and talks to each one independently.
MemKV servers do not talk to each other. No gossip, replication, leader election, or internal routing layer. Each server owns its local NVMe drives and exposes them independently — cross-server aggregation lives entirely in the client (the NIXL plugin or calling application).
This is the mechanism behind linear scaling:
- Client-side routing — clients shard keys across servers and RDMA directly to the owner.
- Zero cross-server traffic — adding a server adds capacity, not coordination overhead; nothing competes with the RDMA data path.
- Independent failure domains — a server failure is local. The client reroutes or falls back to baseline (prefill-from-scratch) for blocks the failed node owned.
- Operational simplicity — no quorum, no split-brain, no cluster-wide config to keep in sync.
N servers deliver ~N× the aggregate throughput of one because there is nothing between them to become the bottleneck.
See also
- Transport — the RDMA and TCP carriers, authentication, and the offload flow.
- Storage — extent layout, on-drive versioning, TRIM.
- Benchmarks — measured throughput and linear-scaling numbers.
KV Cache Sizing
How "64K context" turns into bytes on the wire — the formula, worked examples for Llama-3.1-70B, TP sharding math, and a cross-reference to LMCache's calculator.
Transport & Auth
How MemKV moves bytes — RDMA DC, RC fallback, the TCP wire format, HMAC-SHA256 authentication, and the context-block offload flow.