MemKV
Integrate

vLLM + MemKV

Run vLLM with MemKV as the durable, shareable storage tier behind LMCache. Set up the plugin, point LMCache at it, and let vLLM serve.

vLLM offloads its KV-cache transport through LMCacheConnectorV1, which hands per-chunk store/retrieve to LMCache. LMCache loads storage backends through a dynamic plugin loader, and the MemKV plugin slots in there. vLLM never sees MemKV directly — it talks to LMCache, LMCache talks to the plugin, the plugin talks to a MemKV cluster.

The rest of this page is the wire-up: get a vLLM serve invocation landing KV chunks in MemKV and reading them back through LMCache.

What you need

  • A running MemKV cluster (one or more nodes).
  • A MemKV license file (minio.license).
  • The MemKV auth key (32-byte HMAC, hex-encoded).
  • The memkv_lmcache wheel for your target platform.
  • An installation of LMCache and vLLM (the official vllm/vllm-openai image already has vLLM; LMCache is a pip-install away).
  • One or more RDMA NICs visible on the GPU host if you want the fast path — set MEMKV_RDMA_DEVICES=mlx5_0,mlx5_1 to bind them.

Step 1: bring up MemKV

Use the standard MemKV deployment flow. Once the cluster is up, each node listens on TCP :9900 for the wire protocol and HTTP :9901 for admin (by default; data_port + 1).

Step 2: install the plugin wheel

The memkv_lmcache wheel is a Python package with a native extension. Download the build that matches your platform and install it alongside lmcache into the same Python environment vLLM and LMCache run in:

curl -LO https://dl.minio.io/aistor/memkv/release/linux-amd64/memkv_lmcache-latest-cp39-abi3-linux_x86_64.whl
pip install lmcache ./memkv_lmcache-latest-cp39-abi3-linux_x86_64.whl

Inside a container, mount the wheel directory and pip-install at startup (see the docker run example in Step 5).

Step 3: write the LMCache config

Create an LMCache yaml that selects the MemKV backend through storage_plugins. Two settings are mandatory:

  • local_cpu: True and max_local_cpu_size > 0 — LMCache passes its CPU pool down to the plugin so it has somewhere to allocate retrieved tensors.
  • storage_plugins: memkv — names the dynamic-loaded backend. The matching extra_config block tells LMCache which Python module to import.
chunk_size: 256
local_cpu: True
max_local_cpu_size: 60
storage_plugins: memkv
extra_config:
  storage_plugin.memkv.module_path: memkv_lmcache.backend
  storage_plugin.memkv.class_name: MemKVStorageBackend

max_local_cpu_size is in GiB. Size it large enough that LMCache has working room for its host pool — small values (single-digit GiB) starve the local tier and force every read through MemKV; huge values can contend with CUDA on systems where pinned host memory backs NVLink transfers.

chunk_size is the LMCache write granularity in tokens. Only prompts longer than chunk_size produce backend traffic — each completed chunk becomes one batched store call.

Step 4: configure the MemKV connection

The plugin reads the standard MemKV config chain — MEMKV_CONFIG yaml first, then MEMKV_* env vars. For most operators the env vars are enough:

export MEMKV_SERVERS="host-a:9900,host-b:9900"
export MEMKV_AUTH_KEY="<64-hex-char auth key>"
export MEMKV_TRANSPORT=tcp           # or auto / rdma
export MEMKV_LICENSE=/path/to/minio.license
export MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1"     # required for the RDMA fast path
export MEMKV_STAGING_SIZE_MB=1024             # per-rank staging pool
export MEMKV_STAGING_SLOT_MB=16               # per-op staging slot

MEMKV_TRANSPORT=tcp is the right default inside containers without /dev/infiniband exposed. For the RDMA fast path inside Docker, add --device=/dev/infiniband and --cap-add=IPC_LOCK --ulimit memlock=-1 to the run command and set MEMKV_TRANSPORT=auto (or rdma to fail loudly if no HCA is visible). MEMKV_RDMA_DEVICES is required when RDMA is selected; without it the plugin has no HCA to bind and falls back to TCP.

The staging knobs control how much pinned host memory each TP worker reserves. With TP=8, the per-host commitment is 8 × MEMKV_STAGING_SIZE_MB.

Step 5: launch vLLM with the LMCache connector

Tell vLLM to use LMCacheConnectorV1 via --kv-transfer-config, and point LMCache at the yaml from step 3 via the LMCACHE_CONFIG_FILE env var:

docker run -d --name vllm-memkv \
    --runtime=nvidia --net=host --shm-size=64g --ipc=host \
    -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    -e MEMKV_SERVERS="host-a:9900,host-b:9900" \
    -e MEMKV_AUTH_KEY="$AUTH_KEY" \
    -e MEMKV_TRANSPORT=tcp \
    -e MEMKV_LICENSE=/minio.license \
    -e LMCACHE_CONFIG_FILE=/lmcache.yaml \
    -v /path/to/models:/inference-models:ro \
    -v /path/to/minio.license:/minio.license:ro \
    -v /path/to/lmcache.yaml:/lmcache.yaml:ro \
    -v /path/to/wheels:/plugins:ro \
    --entrypoint bash \
    vllm/vllm-openai:<tag> -lc '
      pip install lmcache /plugins/memkv_lmcache-*.whl &&
      vllm serve /inference-models/<model-dir> \
        --host 0.0.0.0 --port 8810 \
        --tensor-parallel-size <N> \
        --trust-remote-code \
        --max-model-len <max_seq_len> \
        --gpu-memory-utilization 0.85 \
        --enable-prefix-caching \
        --kv-transfer-config "{\"kv_connector\":\"LMCacheConnectorV1\",\"kv_role\":\"kv_both\"}"
    '

A clean startup logs (per TP worker, in order):

  • Successfully installed memkv-lmcache-<version> — pip ran fine.
  • LMCache INFO: Creating LMCacheEngine with config: {... storage_plugins: ['memkv'] ...} — your yaml was picked up.
  • LMCache INFO: Created dynamic backend: memkv — LMCache loaded the plugin module.
  • INFO Creating memkv engine config=PluginConfig {...} — the plugin constructed its Engine.
  • INFO memkv-client license verified plan=... — the license is valid.
  • INFO Server mapped to rail server="..." — each configured MemKV server is registered with the router.
  • INFO MemKVStorageBackend ready (servers=[...], rdma=..., dst_device=...) — the plugin is fully online.

When the first prompt larger than chunk_size arrives, LMCache will log lines like:

LMCache INFO: [req_id=...] Stored ... tokens. ...
                        ... offload_time: ... put_time: ...

put_time is exactly the time the plugin spent inside MemKV put calls.

What this integration buys

  • Capacity beyond per-replica RAM. Once LMCache's local CPU pool fills, evicted chunks flow to MemKV instead of being dropped. The aggregate prefix cache becomes (host RAM × replicas) + MemKV cluster.
  • Cross-replica sharing. Multiple vLLM processes pointed at the same MemKV cluster share one chunk pool — a system prompt used in many replicas is stored once.
  • Durability. On-drive shards survive vLLM and LMCache restarts. (See Operational notes for the cold-restart caveat.)
  • HMAC auth. Every op is HMAC-authenticated with the cluster-wide shared key.

Operational notes

  • Per-rank, per-LMCacheEngine connections. With TP=N you will see N sets of TCP sessions per MemKV server. LMCache instantiates one CacheEngine per worker, and each gets its own MemKV client.
  • Cross-restart cold start is MVP-restricted. The plugin tracks per-key shape/dtype in an in-process dict that get_blocking needs to size the receive buffer. The bytes survive in MemKV across restarts; the dict does not. A fresh process re-prefills until traffic rebuilds the dict.
  • Long keys collapse to a digest. MemKV's wire protocol caps keys at 512 bytes; LMCache cache keys longer than 480 bytes are replaced by a deterministic memkv-h2:<digest> form. The 32-byte headroom leaves room for the memkv-h2: prefix and digest.
  • License is mandatory. The plugin verifies the license at construction; without one, LMCache fails to bring the backend up and vLLM aborts startup. Mount the license file and set MEMKV_LICENSE (or configure via MEMKV_CONFIG).
  • pin / unpin are local-only. MemKV has no per-client retention; the methods are presence checks against the local meta dict. Server-side eviction is owned by the MemKV cluster.

Roadmap

  • Wire-stored shape/dtype so a fresh vLLM/LMCache process can rehydrate from MemKV without losing the prior session's chunks.
  • RDMA fast path inside Docker by exposing /dev/infiniband and the matching capabilities; the plugin flips onto an RDMA READ direct-into-buffer path for reads.
  • Async put so the per-chunk wire latency does not show up on the request hot path.

References