Benchmarks
Measured MemKV throughput — 96.7 GiB/s peak read, 95.8 GiB/s peak write on a 2-server fleet at ~97% of 2× 400GbE line rate. The NIXL plugin's request batch optimizer fetches shared read ranges once and scatters locally, so effective read bandwidth exceeds raw NIC line rate.
Throughput by Block Size (2-Server Aggregate)
Test harness. 1 GPU node (2× ConnectX-7 400GbE, 64 threads) driving 2 storage servers (12 NVMe each)
through the memkv-bench binary, running the identical block-size sweep over both transports. Last verified
against main on 2026-06-03.
| Block Size | RDMA Write | RDMA Read | TCP Write | TCP Read |
|---|---|---|---|---|
| 4 KB | 0.50 GiB/s | 0.68 GiB/s | 1.03 GiB/s | 1.10 GiB/s |
| 8 KB | 0.94 GiB/s | 1.36 GiB/s | 2.19 GiB/s | 2.11 GiB/s |
| 16 KB | 1.98 GiB/s | 2.71 GiB/s | 3.77 GiB/s | 3.95 GiB/s |
| 32 KB | 3.89 GiB/s | 5.38 GiB/s | 7.66 GiB/s | 7.77 GiB/s |
| 64 KB | 7.75 GiB/s | 10.83 GiB/s | 10.39 GiB/s | 13.56 GiB/s |
| 128 KB | 15.92 GiB/s | 19.83 GiB/s | 12.16 GiB/s | 21.00 GiB/s |
| 256 KB | 29.52 GiB/s | 35.76 GiB/s | 14.61 GiB/s | 24.50 GiB/s |
| 512 KB | 50.70 GiB/s | 57.61 GiB/s | 15.34 GiB/s | 26.72 GiB/s |
| 1 MB | 78.32 GiB/s | 80.36 GiB/s | 16.10 GiB/s | 27.14 GiB/s |
| 2 MB | 93.25 GiB/s | 92.53 GiB/s | 16.29 GiB/s | 27.34 GiB/s |
| 4 MB | 93.86 GiB/s | 94.99 GiB/s | 16.34 GiB/s | 28.59 GiB/s |
| 8 MB | 95.91 GiB/s | 96.01 GiB/s | 10.59 GiB/s | 22.16 GiB/s |
| 16 MB | 96.61 GiB/s | 96.47 GiB/s | 10.59 GiB/s | 21.84 GiB/s |
RDMA peak: 96.5 GiB/s read, 96.6 GiB/s write — ~97% of 2× 400GbE line rate. TCP peak: 28.6 GiB/s read, 16.3 GiB/s write. RDMA dominates large transfers; TCP leads on small blocks (≤64 KB), where its multiplexed pipelining beats per-WR RDMA overhead, and remains a viable transport wherever RDMA can't reach end to end (routed/multi-hop fabrics, cloud, mixed NICs). TCP figures use the inline-bulk path with 64 client connections (one per thread).
These benchmarks used PCIe Gen4 QLC drives. Latency and sustained write throughput both improve with Gen5 TLC/SLC drives. The network is already the ceiling here — faster drives primarily help tail latency under load.
Linear Scaling
| Configuration | Servers | Peak Write | Peak Read |
|---|---|---|---|
| Single Server | 1 | 47.9 GiB/s | 48.4 GiB/s |
| Dual Server | 2 | 95.8 GiB/s | 96.7 GiB/s |
Each server has 12 NVMe drives attached to the same PCIe domain as the NIC. No coordination or drive sharing between servers — add servers to scale throughput linearly.
NIXL Plugin (nixlbench)
NVIDIA's nixlbench (from nixl v1.2.0) driving the MemKV NIXL plugin against the same dual-server fleet, DRAM destinations, with --recreate_xfer (a fresh transfer every iteration — no cached-handle fast path), swept to nixlbench's default 64 MiB max block size. The plugin coalesces each post_xfer descriptor list inside the call: same-key contiguous-offset descriptors fold into one wide BatchRead, and the server's 2 MiB bounce-buffer chunking handles the actual transfer.
Batched write throughput (GiB/s)
nixlbench WRITE, batch_size=16, --recreate_xfer, 64 threads, swept to the
default 64 MiB max block size:
| Block | Single rail | Dual rail |
|---|---|---|
| 1 MB | 2.7 GiB/s | 10.7 GiB/s |
| 4 MB | 6.4 GiB/s | 25.2 GiB/s |
| 8 MB | 12.5 GiB/s | 33.5 GiB/s |
| 16 MB | 24.0 GiB/s | 48.1 GiB/s |
| 32 MB | 43.0 GiB/s | 69.9 GiB/s |
| 64 MB | 44.3 GiB/s | 71.9 GiB/s |
A batch's writes persist concurrently across the NVMe drives, so batched WRITE
scales with block size — at 64 MB the single rail reaches ~89% of its line rate.
These are nixlbench-driven figures; the native memkv-bench aggregate
(96.7 GiB/s) is at the top of this page.
Aggregate read throughput scales across rails exactly like writes — the top-of-page figures (48.4 GiB/s single server, 96.7 GiB/s dual) are RDMA reads. The plugin path adds one thing on top: shared-range coalescing.
Read coalescing — fetch once, scatter locally
The plugin coalesces a read descriptor list before it touches the wire. When
several descriptors in one post_xfer resolve to the same KV block with
overlapping or contiguous ranges, the optimizer picks one destination that
spans the full range as a cover, issues a single RDMA read into it, then
fills every other destination with a local copy from that cover —
std::memcpy for host (DRAM) buffers, cudaMemcpy for GPU (VRAM) buffers via
the dynamically-loaded CUDA runtime. The shared block crosses the NIC once
and fans out to N buffers at local memory bandwidth (DRAM/HBM), not network
bandwidth.
This is the prefix-sharing win: when many concurrent sequences read the same prompt prefix, it is pulled across the fabric once and scattered to every consumer in memory. (When destinations are GPU memory and no CUDA copy path is available, or no single destination covers the range, the optimizer splits the chunk back into independent per-destination wire ops — correctness first.)
Writes never coalesce this way — overwriting an object range from two source buffers is undefined — so each write persists independently and write throughput reflects true wire/NVMe bandwidth (above).
Why read throughput can exceed line rate
If you benchmark the plugin with nixlbench you will see read throughput
reported well above NIC line rate — e.g. ~150 GiB/s on a single 400GbE rail,
roughly 3× its ~50 GiB/s wire ceiling, at batch_size=16. This is expected: it
is the coalescing optimizer showing through the benchmark's byte accounting, not
a measurement error.
nixlbenchcredits every descriptor in a batch with a full block transfer — it counts logical bytes delivered to the caller.- The plugin keys each descriptor by its NIXL
dev_id, andnixlbench's synthetic workload uses very few distinct keys (one whennum_initiator_dev=1). A batch's descriptors therefore resolve to the same KV block, and the optimizer fetches that block once over the wire and scatters it to every destination locally. - Reported throughput is logical bytes ÷ time. The shared bytes cross the NIC once but are counted N times, so the effective rate runs above the wire ceiling.
The bytes are real — every destination receives correct data
(--check_consistency=1 passes), and the win is genuine for prefix-sharing
reads. But it is an effective rate, bounded by local memory bandwidth for
the shared portion, not raw network bandwidth. A workload of all-distinct keys
sees no coalescing and reads track the wire-limited rate below.
Per-op read wire rate (batch_size=1)
A lone read descriptor uses one rail, so single- and dual-rail track together — this is the per-op wire ceiling, not the aggregate (that's the native figure above). Single rail, 16 threads; it climbs toward the single-rail line rate (~50 GiB/s) as the block grows and per-op overhead amortizes:
| Block | Read (b=1) |
|---|---|
| 1 MB | 4.8 GiB/s |
| 4 MB | 6.7 GiB/s |
| 16 MB | 25.6 GiB/s |
| 32 MB | 40.0 GiB/s |
| 64 MB | 40.8 GiB/s |
Read latency — batching amortizes the round-trip
A batch of read descriptors that share a block collapses into one coalesced
wire op, so the whole batch completes in roughly the wall-clock of a single
un-batched read. Per descriptor, latency therefore drops by close to the
batch factor — measured at about 16× from batch_size=1 to batch_size=16
at the smaller block sizes — because all 16 descriptors are served by that one
fetch plus a local scatter, not 16 independent round-trips. Absolute per-op
latency depends on concurrency (threads contending for the rail), so the figure
that travels is the ratio: batching turns N descriptors into one wire
round-trip.
Data verification (--check_consistency=1)
Throughput is meaningless if the bytes are wrong, so we verify data integrity
with --check_consistency=1 — actual end-to-end verification, not a checksum
stand-in:
- Writes are read back off the drives. The initiator buffer is filled with
a known sentinel (
0xaa) and written to MemKV. The harness then poisons that local buffer to0x00, issues an RDMA read-back from MemKV, and byte-compares the returned data against the sentinel — a single wrong byte fails the run. This walks the whole path: client memory → server → NVMe → server → client memory. - Reads verify every delivered byte in the client buffer against the expected sentinel.
Across the full block-size sweep (4 KB → 64 MB), every verified transfer passed — zero byte mismatches. What lands on the NVMe drives is exactly what reads back to the client.