vLLM + MemKV
Run vLLM with MemKV as the durable, shareable storage tier behind LMCache. Set up the plugin, point LMCache at it, and let vLLM serve.
vLLM offloads its KV-cache transport through LMCacheConnectorV1,
which hands per-chunk store/retrieve to LMCache. LMCache loads
storage backends through a dynamic plugin loader, and the MemKV
plugin slots in there. vLLM never sees MemKV directly — it talks
to LMCache, LMCache talks to the plugin, the plugin talks to a
MemKV cluster.
The rest of this page is the wire-up: get a vLLM serve invocation landing KV chunks in MemKV and reading them back through LMCache.
What you need
- A running MemKV cluster (one or more nodes).
- A MemKV license file (
minio.license). - The MemKV auth key (32-byte HMAC, hex-encoded).
- The
memkv_lmcachewheel for your target platform. - An installation of LMCache and vLLM (the official
vllm/vllm-openaiimage already has vLLM; LMCache is a pip-install away). - One or more RDMA NICs visible on the GPU host if you want the
fast path — set
MEMKV_RDMA_DEVICES=mlx5_0,mlx5_1to bind them.
Step 1: bring up MemKV
Use the standard MemKV deployment flow. Once the cluster is up,
each node listens on TCP :9900 for the wire protocol and HTTP
:9901 for admin (by default; data_port + 1).
Step 2: install the plugin wheel
The memkv_lmcache wheel is a Python package with a native
extension. Download the build that matches your platform and
install it alongside lmcache into the same Python environment
vLLM and LMCache run in:
curl -LO https://dl.minio.io/aistor/memkv/release/linux-amd64/memkv_lmcache-latest-cp39-abi3-linux_x86_64.whl
pip install lmcache ./memkv_lmcache-latest-cp39-abi3-linux_x86_64.whlInside a container, mount the wheel directory and pip-install at
startup (see the docker run example in Step 5).
Step 3: write the LMCache config
Create an LMCache yaml that selects the MemKV backend through
storage_plugins. Two settings are mandatory:
local_cpu: Trueandmax_local_cpu_size > 0— LMCache passes its CPU pool down to the plugin so it has somewhere to allocate retrieved tensors.storage_plugins: memkv— names the dynamic-loaded backend. The matchingextra_configblock tells LMCache which Python module to import.
chunk_size: 256
local_cpu: True
max_local_cpu_size: 60
storage_plugins: memkv
extra_config:
storage_plugin.memkv.module_path: memkv_lmcache.backend
storage_plugin.memkv.class_name: MemKVStorageBackendmax_local_cpu_size is in GiB. Size it large enough that LMCache has
working room for its host pool — small values (single-digit GiB) starve
the local tier and force every read through MemKV; huge values can
contend with CUDA on systems where pinned host memory backs NVLink
transfers.
chunk_size is the LMCache write granularity in tokens. Only
prompts longer than chunk_size produce backend traffic — each
completed chunk becomes one batched store call.
Step 4: configure the MemKV connection
The plugin reads the standard MemKV config chain — MEMKV_CONFIG
yaml first, then MEMKV_* env vars. For most operators the env
vars are enough:
export MEMKV_SERVERS="host-a:9900,host-b:9900"
export MEMKV_AUTH_KEY="<64-hex-char auth key>"
export MEMKV_TRANSPORT=tcp # or auto / rdma
export MEMKV_LICENSE=/path/to/minio.license
export MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1" # required for the RDMA fast path
export MEMKV_STAGING_SIZE_MB=1024 # per-rank staging pool
export MEMKV_STAGING_SLOT_MB=16 # per-op staging slotMEMKV_TRANSPORT=tcp is the right default inside containers
without /dev/infiniband exposed. For the RDMA fast path inside
Docker, add --device=/dev/infiniband and
--cap-add=IPC_LOCK --ulimit memlock=-1 to the run command and
set MEMKV_TRANSPORT=auto (or rdma to fail loudly if no HCA is
visible). MEMKV_RDMA_DEVICES is required when RDMA is selected;
without it the plugin has no HCA to bind and falls back to TCP.
The staging knobs control how much pinned host memory each TP
worker reserves. With TP=8, the per-host commitment is 8 ×
MEMKV_STAGING_SIZE_MB.
Step 5: launch vLLM with the LMCache connector
Tell vLLM to use LMCacheConnectorV1 via --kv-transfer-config,
and point LMCache at the yaml from step 3 via the
LMCACHE_CONFIG_FILE env var:
docker run -d --name vllm-memkv \
--runtime=nvidia --net=host --shm-size=64g --ipc=host \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e MEMKV_SERVERS="host-a:9900,host-b:9900" \
-e MEMKV_AUTH_KEY="$AUTH_KEY" \
-e MEMKV_TRANSPORT=tcp \
-e MEMKV_LICENSE=/minio.license \
-e LMCACHE_CONFIG_FILE=/lmcache.yaml \
-v /path/to/models:/inference-models:ro \
-v /path/to/minio.license:/minio.license:ro \
-v /path/to/lmcache.yaml:/lmcache.yaml:ro \
-v /path/to/wheels:/plugins:ro \
--entrypoint bash \
vllm/vllm-openai:<tag> -lc '
pip install lmcache /plugins/memkv_lmcache-*.whl &&
vllm serve /inference-models/<model-dir> \
--host 0.0.0.0 --port 8810 \
--tensor-parallel-size <N> \
--trust-remote-code \
--max-model-len <max_seq_len> \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--kv-transfer-config "{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}"
'A clean startup logs (per TP worker, in order):
Successfully installed memkv-lmcache-<version>— pip ran fine.LMCache INFO: Creating LMCacheEngine with config: {... storage_plugins: ['memkv'] ...}— your yaml was picked up.LMCache INFO: Created dynamic backend: memkv— LMCache loaded the plugin module.INFO Creating memkv engine config=PluginConfig {...}— the plugin constructed itsEngine.INFO memkv-client license verified plan=...— the license is valid.INFO Server mapped to rail server="..."— each configured MemKV server is registered with the router.INFO MemKVStorageBackend ready (servers=[...], rdma=..., dst_device=...)— the plugin is fully online.
When the first prompt larger than chunk_size arrives, LMCache
will log lines like:
LMCache INFO: [req_id=...] Stored ... tokens. ...
... offload_time: ... put_time: ...put_time is exactly the time the plugin spent inside MemKV
put calls.
What this integration buys
- Capacity beyond per-replica RAM. Once LMCache's local CPU pool fills, evicted chunks flow to MemKV instead of being dropped. The aggregate prefix cache becomes (host RAM × replicas) + MemKV cluster.
- Cross-replica sharing. Multiple vLLM processes pointed at the same MemKV cluster share one chunk pool — a system prompt used in many replicas is stored once.
- Durability. On-drive shards survive vLLM and LMCache restarts. (See Operational notes for the cold-restart caveat.)
- HMAC auth. Every op is HMAC-authenticated with the cluster-wide shared key.
Operational notes
- Per-rank, per-LMCacheEngine connections. With TP=N you
will see N sets of TCP sessions per MemKV server. LMCache
instantiates one
CacheEngineper worker, and each gets its own MemKV client. - Cross-restart cold start is MVP-restricted. The plugin
tracks per-key shape/dtype in an in-process dict that
get_blockingneeds to size the receive buffer. The bytes survive in MemKV across restarts; the dict does not. A fresh process re-prefills until traffic rebuilds the dict. - Long keys collapse to a digest. MemKV's wire protocol caps
keys at 512 bytes; LMCache cache keys longer than 480 bytes
are replaced by a deterministic
memkv-h2:<digest>form. The 32-byte headroom leaves room for thememkv-h2:prefix and digest. - License is mandatory. The plugin verifies the license at
construction; without one, LMCache fails to bring the backend
up and vLLM aborts startup. Mount the license file and set
MEMKV_LICENSE(or configure viaMEMKV_CONFIG). pin/unpinare local-only. MemKV has no per-client retention; the methods are presence checks against the local meta dict. Server-side eviction is owned by the MemKV cluster.
Roadmap
- Wire-stored shape/dtype so a fresh vLLM/LMCache process can rehydrate from MemKV without losing the prior session's chunks.
- RDMA fast path inside Docker by exposing
/dev/infinibandand the matching capabilities; the plugin flips onto an RDMA READ direct-into-buffer path for reads. - Async put so the per-chunk wire latency does not show up on the request hot path.
References
sglang + MemKV
Run sglang with MemKV as the durable, shareable storage tier behind HiCache. Set up the plugin, point sglang at it, and let HiCache flow KV pages into MemKV.
llama.cpp + MemKV
Durable KV store for llama-server backed by MemKV. Multi-turn chats, multi-tenant deployments, and agent loops resume in milliseconds instead of re-prefilling tokens.