MemKV
Internals

Architecture

How MemKV is shaped — the three deployment topologies, the shared-nothing model, and where the data lives.

MemKV integrates with inference frameworks via NVIDIA NIXL and ships in three deployment shapes that share the same wire protocol, auth model, and memkv-client engine. Only storage backend and transport differ.

FIG. 01 — Distributed (multi-server, RDMA + JBOF)

The performance path. Multi-server fleet, RDMA over DC transport, NVMe via io_uring + O_DIRECT, hugepage-backed bounce buffers. The shape benchmarked at 96.7 GiB/s.

MemKV distributed architecture — multi-server with RDMA and JBOF

FIG. 01·DR — Co-located with drives (single GPU node)

One MemKV server on the same host as the inference engine, with a few local NVMe drives. Same internals as the distributed shape — RDMA + DC + JBOF + io_uring — but the data plane stays inside one chassis, with the local NIC moving bytes between memory regions without traversing the IP stack.

MemKV co-located with drives — single GPU node with local NVMe and RDMA

FIG. 01·FL — Co-located, file mode (no raw drives)

For hosts without raw drives to dedicate: developer laptops, Apple Silicon Macs, edge AI nodes, kind / CI clusters. File-mode storage on a regular filesystem, TCP-only transport, no RDMA, no JBOF, no hugepages. Same auth, same license, same on-the-wire format. Bandwidth is bounded by the host filesystem and NIC.

MemKV co-located, file mode — single host with file-mode storage and TCP transport

The server runs on commodity x86 / ARM Linux hosts with NVMe drives and a mlx5 RDMA NIC for the distributed shape. macOS and Apple Silicon are supported for the file-mode co-located shape only.

Shared-Nothing Architecture

No gossip. No quorum. No rebalance. Every MemKV server is a self-contained unit. Aggregation is a client-side concern — the NIXL plugin shards across servers and talks to each one independently.

MemKV servers do not talk to each other. No gossip, replication, leader election, or internal routing layer. Each server owns its local NVMe drives and exposes them independently — cross-server aggregation lives entirely in the client (the NIXL plugin or calling application).

This is the mechanism behind linear scaling:

  • Client-side routing — clients shard keys across servers and RDMA directly to the owner.
  • Zero cross-server traffic — adding a server adds capacity, not coordination overhead; nothing competes with the RDMA data path.
  • Independent failure domains — a server failure is local. The client reroutes or falls back to baseline (prefill-from-scratch) for blocks the failed node owned.
  • Operational simplicity — no quorum, no split-brain, no cluster-wide config to keep in sync.

N servers deliver ~N× the aggregate throughput of one because there is nothing between them to become the bottleneck.

See also

  • Transport — the RDMA and TCP carriers, authentication, and the offload flow.
  • Storage — extent layout, on-drive versioning, TRIM.
  • Benchmarks — measured throughput and linear-scaling numbers.