Benchmarks

Measured MemKV throughput — 97.4 GiB/s peak read, 97.4 GiB/s peak write on a 2-server fleet at ~97% of 2× 400GbE line rate. The NIXL plugin's request batch optimizer fetches shared read ranges once and scatters locally, so effective read bandwidth exceeds raw NIC line rate.

Throughput by Block Size (2-Server Aggregate)

Test harness. 1 GPU node (2× ConnectX-7 400GbE) driving 2 storage servers (12 NVMe each) through the memkv bench binary at 64 threads per server (128 total on the dual-server sweep). RDMA and TCP were swept separately. Last verified against main on 2026-07-06.

Block Size	RDMA Write	RDMA Read	TCP Write	TCP Read
4 KB	0.98 GiB/s	1.41 GiB/s	1.03 GiB/s	1.06 GiB/s
8 KB	2.03 GiB/s	2.86 GiB/s	2.19 GiB/s	2.10 GiB/s
16 KB	4.09 GiB/s	5.24 GiB/s	3.77 GiB/s	3.38 GiB/s
32 KB	7.91 GiB/s	10.87 GiB/s	7.66 GiB/s	6.32 GiB/s
64 KB	16.21 GiB/s	21.19 GiB/s	10.39 GiB/s	11.55 GiB/s
128 KB	30.53 GiB/s	38.81 GiB/s	12.16 GiB/s	13.88 GiB/s
256 KB	53.92 GiB/s	70.15 GiB/s	14.61 GiB/s	15.99 GiB/s
512 KB	61.59 GiB/s	91.35 GiB/s	15.34 GiB/s	19.71 GiB/s
1 MB	89.77 GiB/s	96.11 GiB/s	16.10 GiB/s	20.45 GiB/s
2 MB	93.14 GiB/s	96.98 GiB/s	16.29 GiB/s	21.69 GiB/s
4 MB	97.09 GiB/s	97.25 GiB/s	16.34 GiB/s	22.26 GiB/s
8 MB	97.30 GiB/s	97.41 GiB/s	10.59 GiB/s	16.33 GiB/s
16 MB	97.36 GiB/s	97.45 GiB/s	10.59 GiB/s	16.54 GiB/s

RDMA peak: 97.4 GiB/s read, 97.4 GiB/s write — ~97% of 2× 400GbE line rate. TCP peak: 22.3 GiB/s read, 16.3 GiB/s write. RDMA tracks near line rate across the sweep. TCP remains a viable transport wherever RDMA can't reach end to end (routed/multi-hop fabrics, cloud, mixed NICs), sustaining tens of GiB/s via multiplexed pipelining. TCP figures are a separate sweep over the inline-bulk path with 64 client connections (one per thread).

These benchmarks used PCIe Gen4 QLC drives. Latency and sustained write throughput both improve with Gen5 TLC/SLC drives. The network is already the ceiling here — faster drives primarily help tail latency under load.

Linear Scaling

Configuration	Servers	Peak Write	Peak Read
Single Server	1	48.7 GiB/s	48.7 GiB/s
Dual Server	2	97.4 GiB/s	97.4 GiB/s

Each server has 12 NVMe drives attached to the same PCIe domain as the NIC. No coordination or drive sharing between servers — add servers to scale throughput linearly.

NIXL Plugin (nixlbench)

NVIDIA's nixlbench (from nixl v1.2.0) driving the MemKV NIXL plugin against the same dual-server fleet, DRAM destinations, with --recreate_xfer (a fresh transfer every iteration — no cached-handle fast path), swept to nixlbench's default 64 MiB max block size. The plugin coalesces each post_xfer descriptor list inside the call: same-key contiguous-offset descriptors fold into one wide BatchRead, and the server's 2 MiB bounce-buffer chunking handles the actual transfer.

Batched write throughput (GiB/s)

nixlbench WRITE, batch_size=16, --recreate_xfer, 64 threads, swept to the default 64 MiB max block size:

Block	Single rail	Dual rail
1 MB	2.7 GiB/s	10.7 GiB/s
4 MB	6.4 GiB/s	25.2 GiB/s
8 MB	12.5 GiB/s	33.5 GiB/s
16 MB	24.0 GiB/s	48.1 GiB/s
32 MB	43.0 GiB/s	69.9 GiB/s
64 MB	44.3 GiB/s	71.9 GiB/s

A batch's writes persist concurrently across the NVMe drives, so batched WRITE scales with block size — at 64 MB the single rail reaches ~89% of its line rate. These are nixlbench-driven figures; the native memkv bench aggregate (97.4 GiB/s) is at the top of this page.

Aggregate read throughput scales across rails exactly like writes — the top-of-page figures (48.7 GiB/s single server, 97.4 GiB/s dual) are RDMA reads. The plugin path adds one thing on top: shared-range coalescing.

Read coalescing — fetch once, scatter locally

The plugin coalesces a read descriptor list before it touches the wire. When several descriptors in one post_xfer resolve to the same KV block with overlapping or contiguous ranges, the optimizer picks one destination that spans the full range as a cover, issues a single RDMA read into it, then fills every other destination with a local copy from that cover — std::memcpy for host (DRAM) buffers, cudaMemcpy for GPU (VRAM) buffers via the dynamically-loaded CUDA runtime. The shared block crosses the NIC once and fans out to N buffers at local memory bandwidth (DRAM/HBM), not network bandwidth.

This is the prefix-sharing win: when many concurrent sequences read the same prompt prefix, it is pulled across the fabric once and scattered to every consumer in memory. (When destinations are GPU memory and no CUDA copy path is available, or no single destination covers the range, the optimizer splits the chunk back into independent per-destination wire ops — correctness first.)

Writes never coalesce this way — overwriting an object range from two source buffers is undefined — so each write persists independently and write throughput reflects true wire/NVMe bandwidth (above).

Why read throughput can exceed line rate

If you benchmark the plugin with nixlbench you will see read throughput reported well above NIC line rate — e.g. ~150 GiB/s on a single 400GbE rail, roughly 3× its ~50 GiB/s wire ceiling, at batch_size=16. This is expected: it is the coalescing optimizer showing through the benchmark's byte accounting, not a measurement error.

nixlbench credits every descriptor in a batch with a full block transfer — it counts logical bytes delivered to the caller.
The plugin keys each descriptor by its NIXL dev_id, and nixlbench's synthetic workload uses very few distinct keys (one when num_initiator_dev=1). A batch's descriptors therefore resolve to the same KV block, and the optimizer fetches that block once over the wire and scatters it to every destination locally.
Reported throughput is logical bytes ÷ time. The shared bytes cross the NIC once but are counted N times, so the effective rate runs above the wire ceiling.

The bytes are real — every destination receives correct data (--check_consistency=1 passes), and the win is genuine for prefix-sharing reads. But it is an effective rate, bounded by local memory bandwidth for the shared portion, not raw network bandwidth. A workload of all-distinct keys sees no coalescing and reads track the wire-limited rate below.

Per-op read wire rate (`batch_size=1`)

A lone read descriptor uses one rail, so single- and dual-rail track together — this is the per-op wire ceiling, not the aggregate (that's the native figure above). Single rail, 16 threads; it climbs toward the single-rail line rate (~50 GiB/s) as the block grows and per-op overhead amortizes:

Block	Read (b=1)
1 MB	4.8 GiB/s
4 MB	6.7 GiB/s
16 MB	25.6 GiB/s
32 MB	40.0 GiB/s
64 MB	40.8 GiB/s

Read latency — batching amortizes the round-trip

A batch of read descriptors that share a block collapses into one coalesced wire op, so the whole batch completes in roughly the wall-clock of a single un-batched read. Per descriptor, latency therefore drops by close to the batch factor — measured at about 16× from batch_size=1 to batch_size=16 at the smaller block sizes — because all 16 descriptors are served by that one fetch plus a local scatter, not 16 independent round-trips. Absolute per-op latency depends on concurrency (threads contending for the rail), so the figure that travels is the ratio: batching turns N descriptors into one wire round-trip.

Data verification (`--check_consistency=1`)

Throughput is meaningless if the bytes are wrong, so we verify data integrity with --check_consistency=1 — actual end-to-end verification, not a checksum stand-in:

Writes are read back off the drives. The initiator buffer is filled with a known sentinel (0xaa) and written to MemKV. The harness then poisons that local buffer to 0x00, issues an RDMA read-back from MemKV, and byte-compares the returned data against the sentinel — a single wrong byte fails the run. This walks the whole path: client memory → server → NVMe → server → client memory.
Reads verify every delivered byte in the client buffer against the expected sentinel.

Across the full block-size sweep (4 KB → 64 MB), every verified transfer passed — zero byte mismatches. What lands on the NVMe drives is exactly what reads back to the client.

On this page