MemKV
Internals

Storage Internals

How MemKV stores blocks on NVMe — extent layout, forward-compatible versioning, and batched TRIM.

Extent-Based Storage

Large context blocks are split across multiple NVMe drives to enable parallel I/O. Two dials govern the split:

  • storage.block_size — the on-drive extent size. Default 4 MiB. Allocations larger than one extent are striped across drives; smaller allocations live in a single extent. Tune per workload (smaller for many small objects, larger for sustained bulk transfers).
  • memory.block_size — the bounce-buffer chunk size used by the RDMA staging pool. Default 2 MiB. This is the granularity of a single RDMA READ / WRITE; multi-extent transfers pipeline through many chunks.

The two are independent — a 4 MiB extent is reached by two 2 MiB bounce-buffer chunks at the default settings.

Versioning and Forward Compatibility

MemKV releases are date-stamped (RELEASE.<commit-date>). There is no semver line because the on-drive and wire formats are forward-compatible by contract: any future MemKV build accepts records — superblocks, journal headers, B+Tree pages, the embedded failure registry, and TCP message headers — whose version is less than or equal to that build's current constant. A strictly higher version is refused.

For operators this means:

  • Roll-forward is automatic. Upgrading the binary keeps the NVMe state usable with no migration step. The new server simply reads the older format.
  • Don't roll back across format bumps. An older MemKV binary refuses NVMe state written by a newer one rather than risk misinterpreting unknown bytes. Plan upgrades as one-way.
  • Mixed-version fleets work in one direction. A newer server accepts requests from older clients (NIXL plugin or admin client); an older server rejects requests from a strictly newer client with UnsupportedVersion. Upgrade servers first, then clients.

The contract is enforced at every header parse site: the wire-protocol header, the device superblock, the embedded failure registry, the journal header, and the B+Tree index header all carry a 1-byte (or 2-byte, for the B+Tree) version field that decoders compare with <= against the build's current constant.

Block Deletion and TRIM

DELETE operations mark blocks as free in the in-memory index and queue TRIM requests to a background worker. The trim worker batches and coalesces adjacent ranges, then issues NVMe TRIM (BLKDISCARD) commands every 5 seconds or when the batch reaches 1,024 extents. This batched approach avoids synchronous TRIM latency while keeping the SSD controller able to perform garbage collection for sustained write performance.