Dynamo + MemKV
Run NVIDIA Dynamo with MemKV as the remote KV-cache tier behind KVBM. Drop the NIXL plugin in, set a handful of env vars, and let KVBM offload evicted blocks into a MemKV cluster over RDMA.
NVIDIA Dynamo runs vLLM behind its own DynamoConnector. The KV-block
manager (KVBM) inside Dynamo offloads evicted blocks through NIXL to a
configurable remote backend. MemKV ships a NIXL plugin
(libplugin_MEMKV.so) that registers as that backend. Once the plugin
is mounted in the Dynamo container and DYN_KVBM_NIXL_BACKEND=MEMKV
is set, KVBM transparently routes offload traffic to a MemKV cluster
over RDMA.
This is the production path behind MemKV's headline benchmark numbers.
What you need
- A running MemKV cluster (one or more nodes).
- A MemKV license file (
minio.license). - The MemKV auth key (32-byte HMAC, hex-encoded).
- A Dynamo runtime image with KVBM (the
nvcr.io/nvidia/ai-dynamo/vllm-runtimeimage and tags built off the miniohq/memkv-rename lineage — see INTEGRATIONS.md in the repo for the fork lineage and patch series). libplugin_MEMKV.sofor your platform.- One or more RDMA NICs visible on the GPU host —
MEMKV_RDMA_DEVICES=mlx5_0,mlx5_1binds them.
Dynamo upstream has no published plugin/extension point for adding a remote
backend yet, so MemKV currently ships against a forked Dynamo build that adds
the MEMKV backend. The fork is captured in this repo's INTEGRATIONS.md; when
Dynamo exposes a stable extension point, this page collapses into "drop the
plugin in and set the env var."
Step 1: bring up MemKV
Use the standard MemKV deployment flow. Once the cluster is up, each
node listens on TCP :9900 for the wire protocol and HTTP :9901 for
admin (by default; data_port + 1).
Step 2: install the NIXL plugin
Download the prebuilt NIXL plugin to the GPU host:
sudo mkdir -p /opt/memkv-plugins
sudo curl -L \
https://dl.minio.io/aistor/memkv/release/linux-amd64/libplugin_MEMKV.so \
-o /opt/memkv-plugins/libplugin_MEMKV.soYou will bind-mount this file into the Dynamo container at the path
NIXL scans (/opt/nvidia/nvda_nixl/lib/plugins/libplugin_MEMKV.so)
in Step 4.
Step 3: configure the MemKV connection
The plugin reads the standard MemKV config chain — MEMKV_* env vars
are the simplest path:
export MEMKV_SERVERS="host-a:9900,host-b:9900"
export MEMKV_AUTH_KEY="<64-hex-char auth key>"
export MEMKV_TRANSPORT=auto # auto | rdma | tcp
export MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1"
export MEMKV_LICENSE=/path/to/minio.license
export MEMKV_STAGING_SIZE_MB=256
export MEMKV_STAGING_SLOT_MB=16
export MEMKV_NUM_CONNECTIONS=4For the RDMA fast path inside Docker, the container needs
--device=/dev/infiniband plus --cap-add=IPC_LOCK --ulimit memlock=-1.
Step 4: launch Dynamo with KVBM pointed at MemKV
The Dynamo container needs three things on top of an ordinary vLLM launch:
- The NIXL plugin bind-mounted into NIXL's plugin directory.
DYN_KVBM_NIXL_BACKEND=MEMKVandDYN_KVBM_REMOTE_STORAGE_TYPE=memkvso KVBM selects the MEMKV path.libucx0installed in the image — the prebuilt NIXL stack the plugin links against pulls in UCX runtime that the upstreamvllm-openaiimage does not ship.
docker run -d --name dynamo-memkv-vllm \
--runtime=nvidia --net=host --shm-size=64g --ipc=host \
--device=/dev/infiniband --cap-add=IPC_LOCK --ulimit memlock=-1 \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e DYN_KVBM_NIXL_BACKEND=MEMKV \
-e DYN_KVBM_REMOTE_STORAGE_TYPE=memkv \
-e DYN_KVBM_CPU_CACHE_GB=60 \
-e MEMKV_SERVERS="host-a:9900,host-b:9900" \
-e MEMKV_AUTH_KEY="$AUTH_KEY" \
-e MEMKV_TRANSPORT=auto \
-e MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1" \
-e MEMKV_NUM_CONNECTIONS=4 \
-e MEMKV_STAGING_SIZE_MB=256 \
-e MEMKV_STAGING_SLOT_MB=16 \
-e MEMKV_LICENSE=/minio.license \
-v /path/to/models:/inference-models:ro \
-v /path/to/minio.license:/minio.license:ro \
-v /opt/memkv-plugins/libplugin_MEMKV.so:/opt/nvidia/nvda_nixl/lib/plugins/libplugin_MEMKV.so:ro \
nvcr.io/nvidia/ai-dynamo/vllm-runtime:<tag> bash -lc '
apt-get update && apt-get install -y --no-install-recommends libucx0 &&
dynamo serve /inference-models/<model-dir> \
--host 0.0.0.0 --port 8810 \
--tensor-parallel-size <N> \
--enable-prefix-caching \
--max-model-len <max_seq_len> \
--gpu-memory-utilization 0.9
'The reference launcher used to produce MemKV's published Dynamo
numbers is scripts/launch-dynamo-memkv.sh in this repo — copy it as
a starting point.
Overriding the KVBM build
When tracking a custom Dynamo fork (the common case today), point the container at the rebuilt KVBM artifacts via two env vars consumed by the launcher:
KVBM_CORE_SO=/host/path/to/lib_core.so— bind-mounts the rebuilt KVBM cdylib over the image's default.KVBM_PY_SRC=/host/path/to/kvbm-python-overlay— bind-mounts a Python overlay so the in-image wrapper stays in sync with the rebuilt cdylib.
Both are optional; default Dynamo images use the bundled KVBM.
What this integration buys
- Capacity beyond per-replica HBM. KVBM evicts blocks from GPU HBM into MemKV instead of dropping them; the aggregate KV cache becomes HBM + CPU pool + MemKV cluster.
- Cross-replica sharing. Multiple Dynamo workers pointed at the same MemKV cluster share one block pool, keyed by NIXL descriptor fields.
- RDMA wire speed. With
MEMKV_TRANSPORT=autoand the right NICs, block transfers ride DC RDMA at HCA line rate. The benchmark page shows 96.7 GiB/s on 2 servers, ~97% of 2× 400GbE. - Durability. On-drive shards survive Dynamo restarts.
- HMAC auth. Every op is HMAC-authenticated with the cluster-wide shared key.
Operational notes
/v1/cache/statuslives in Dynamo, not MemKV. The Dynamo patch series exposes aget_pool_statusmanagement surface; MemKV's/v1/statusadmin endpoint is unrelated. ttft-sweep drains and readiness checks talk to Dynamo's status endpoint, not MemKV's.- Per-rank, per-worker connections. With TP=N you will see N sets of QPs / TCP sessions per MemKV server.
- License is mandatory. The MemKV plugin verifies the license during NIXL load; without one, Dynamo fails to bring the backend up.
- NIXL plugin is
OBJ_SEG-shaped. KVBM hands the plugin an RDMA descriptor list; the plugin posts RDMA WRITE/READ ops withdev_idas the routing key into the MemKV cluster.
Roadmap
- Upstream extension point. When Dynamo exposes a stable plugin
ABI, the captured patch series in
INTEGRATIONS.mdgoes away and the integration becomes plugin-only. - Pre-registered descriptor pools to avoid per-call MR registration on the KVBM side.
References
INTEGRATIONS.md— fork lineage, patch series, and developer-side build steps.- NIXL — NVIDIA Inference Xfer Library
- Dynamo — NVIDIA Inference Framework