MemKV
Internals

Transport & Auth

How MemKV moves bytes — RDMA DC, RC fallback, the TCP wire format, HMAC-SHA256 authentication, and the context-block offload flow.

MemKV speaks the same authenticated wire format over two transports. The Linux full server always exposes both; clients pick per-request.

RDMA (DC + RC)

Data operations use DC (Dynamically Connected) transport on Mellanox mlx5 NICs with DC support. DC removes the per-peer QP setup cost by addressing any DCT (DC Target) endpoint from a shared pool of DCI (DC Initiator) QPs: the number of QPs scales O(N) with the cluster instead of O(N²) in classic RC.

On hardware without DC support, the system falls back to RC (Reliable Connection) QPs transparently. The DCI pool is fixed-size by default (see rdma.num_dcis in Configuration); under heavy fan-in the same DCI can be reacquired round-robin by different workers.

Wire carriers

The server listens on TCP at the configured network.address (default 9900). The RDMA bootstrap (Connect=0x06) rides that TCP connection because the RC QP doesn't exist yet at that point. Once Connect transitions the RC QP to RTS, the steady-state control messages (Allocate, Lookup, Commit, Delete, Read, Write, BatchRead, BatchWrite, Exists) ride RC SEND/RECV on the per-connection QP. Bulk payloads ride RDMA WRITE / RDMA READ on the DCI/DCT pool (or RC fallback when DC isn't available).

Hosts without an RDMA NIC keep the TCP connection for everything: no QP is ever bootstrapped, and the inline-bulk codes (TcpPut=0x20, TcpGet=0x21, TcpDelete=0x22) plus Exists carry block payloads and queries inline in the signed length-prefixed frame.

The client picks transport via MEMKV_TRANSPORT (or transport: in MEMKV_CONFIG):

  • auto — try RDMA first; on the first RDMA failure (boot-time, or in-flight) latch into TCP-only mode for the life of the engine. Logged at WARN. This is the Linux default.
  • rdma — strict. The client errors at startup if rdma_devices is empty or the build is non-Linux; in-flight RDMA failures propagate. Use in production where RDMA must be the actual data path.
  • tcp — skip RDMA entirely. The configured server address is the TCP target directly. This is the macOS default.

Transport choice is a client-side decision; the server has no knob to disable either listener.

Topology vs storage mode

Two orthogonal choices:

  • Topologydistributed (multi-server, NVMe per server; the shape behind the 96.7 GiB/s numbers in Benchmarks) or co-located (one MemKV server on the inference host, common on GPU boxes with NVMe next to the NIC).
  • Storage modedirect is the performance path on raw block devices, in either topology. file is mmap-backed regular files for developers, kind / CI, Macs, and hosts without raw drives to dedicate; bandwidth is bounded by the host filesystem.

macOS is always file-mode + TCP because RDMA / JBOF / hugepages aren't compiled in there. The same shape is reachable on Linux via storage.mode: file + transport: tcp, but a Linux host with raw drives should run jbof for real performance.

Authentication

Every signed wire message carries a 40-byte trailer: an 8-byte timestamp followed by a 32-byte HMAC-SHA256 over (header || ts_ns). The shared key is configured at both ends (network.auth_key server-side, auth_key: client-side, or MEMKV_AUTH_KEY env on either). Drift window is ±60 s; messages outside the window or with a bad MAC are silently dropped. There is no unauthenticated mode — the engine refuses to construct without a key.

Context Block Offload Flow

  1. Dynamo/KVBM identifies context blocks for offload.
  2. NIXL calls the MemKV plugin with client buffer address and rkey.
  3. Plugin sends write request to server via RDMA messaging (RC QP).
  4. Server acquires a DCI from the pool and does RDMA READ from client memory (DC transport).
  5. Server persists to NVMe via io_uring with O_DIRECT.
  6. On a lookup hit, server reads from NVMe and RDMA WRITEs back to client via DCI.