Storage Internals
How MemKV stores blocks on NVMe — extent layout, forward-compatible versioning, and batched TRIM.
Extent-Based Storage
Large context blocks are split across multiple NVMe drives to enable parallel I/O. Two dials govern the split:
storage.block_size— the on-drive extent size. Default 4 MiB. Allocations larger than one extent are striped across drives; smaller allocations live in a single extent. Tune per workload (smaller for many small objects, larger for sustained bulk transfers).memory.block_size— the bounce-buffer chunk size used by the RDMA staging pool. Default 2 MiB. This is the granularity of a single RDMA READ / WRITE; multi-extent transfers pipeline through many chunks.
The two are independent — a 4 MiB extent is reached by two 2 MiB bounce-buffer chunks at the default settings.
Versioning and Forward Compatibility
MemKV releases are date-stamped (RELEASE.<commit-date>). There is no
semver line because the on-drive and wire formats are forward-compatible
by contract: any future MemKV build accepts records — superblocks,
journal headers, B+Tree pages, the embedded failure registry, and
TCP message headers — whose version is less than or equal to
that build's current constant. A strictly higher version is refused.
For operators this means:
- Roll-forward is automatic. Upgrading the binary keeps the NVMe state usable with no migration step. The new server simply reads the older format.
- Don't roll back across format bumps. An older MemKV binary refuses NVMe state written by a newer one rather than risk misinterpreting unknown bytes. Plan upgrades as one-way.
- Mixed-version fleets work in one direction. A newer server
accepts requests from older clients (NIXL plugin or admin client);
an older server rejects requests from a strictly newer client with
UnsupportedVersion. Upgrade servers first, then clients.
The contract is enforced at every header parse site: the wire-protocol
header, the device superblock, the embedded failure registry, the
journal header, and the B+Tree index header all carry a 1-byte (or
2-byte, for the B+Tree) version field that decoders compare with
<= against the build's current constant.
Block Deletion and TRIM
DELETE operations mark blocks as free in the in-memory index and queue TRIM requests to a background worker. The
trim worker batches and coalesces adjacent ranges, then issues NVMe TRIM (BLKDISCARD) commands every 5 seconds or
when the batch reaches 1,024 extents. This batched approach avoids synchronous TRIM latency while keeping the SSD
controller able to perform garbage collection for sustained write performance.
Transport & Auth
How MemKV moves bytes — RDMA DC, RC fallback, the TCP wire format, HMAC-SHA256 authentication, and the context-block offload flow.
Benchmarks
Measured MemKV throughput — 96.7 GiB/s peak read, 95.8 GiB/s peak write on a 2-server fleet at ~97% of 2× 400GbE line rate. The NIXL plugin's request batch optimizer fetches shared read ranges once and scatters locally, so effective read bandwidth exceeds raw NIC line rate.