sglang + MemKV
Run sglang with MemKV as the durable, shareable storage tier behind HiCache. Set up the plugin, point sglang at it, and let HiCache flow KV pages into MemKV.
sglang's HiCache is a tiered prefix cache — a device cache on
the GPU plus a host cache in pinned CPU RAM — backed by a
pluggable storage tier through StorageBackendFactory. The
MemKV plugin slots in alongside the built-in file, nixl,
mooncake, hf3fs, eic, aibrix, and simm backends.
sglang never sees MemKV directly: it talks to the HiCache
controller, the controller talks to the plugin, the plugin
talks to a MemKV cluster.
The rest of this page is the wire-up: HiCache lands evicted host-cache pages in MemKV and reads them back on prefix-cache hits.
What you need
- A running MemKV cluster (one or more nodes).
- A MemKV license file (
minio.license). - The MemKV auth key (32-byte HMAC, hex-encoded).
- The
memkv_sglangwheel for your target platform. - An installation of sglang (the official
lmsysorg/sglangimage is the simplest path). - One or more RDMA NICs visible on the GPU host if you want the
fast path — set
MEMKV_RDMA_DEVICES=mlx5_0,mlx5_1to bind them.
Step 1: bring up MemKV
Use the standard MemKV deployment flow. Once the cluster is up,
each node listens on TCP :9900 for the wire protocol and HTTP
:9901 for admin (by default; data_port + 1).
Step 2: install the plugin wheel
The memkv_sglang wheel is a Python package with a native
extension. Download the build that matches your platform and
install it into the same Python environment sglang runs in:
curl -LO https://dl.minio.io/aistor/memkv/release/linux-amd64/memkv_sglang-latest-cp39-abi3-linux_x86_64.whl
pip install ./memkv_sglang-latest-cp39-abi3-linux_x86_64.whlInside a container, mount the wheel directory and pip-install at
startup (see the docker run example in Step 4).
Step 3: configure the MemKV connection
The plugin reads the standard MemKV config chain — MEMKV_CONFIG
yaml first, then MEMKV_* env vars. For most operators the env
vars are enough:
export MEMKV_SERVERS="host-a:9900,host-b:9900"
export MEMKV_AUTH_KEY="<64-hex-char auth key>"
export MEMKV_TRANSPORT=tcp # or auto / rdma
export MEMKV_LICENSE=/path/to/minio.license
export MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1" # required for the RDMA fast path
export MEMKV_STAGING_SIZE_MB=1024 # per-rank staging pool
export MEMKV_STAGING_SLOT_MB=16 # per-op staging slotMEMKV_TRANSPORT=tcp is the right default inside containers
without /dev/infiniband exposed. For the RDMA fast path inside
Docker, add --device=/dev/infiniband and
--cap-add=IPC_LOCK --ulimit memlock=-1 to the run command and
set MEMKV_TRANSPORT=auto (or rdma to fail loudly if no HCA is
visible). MEMKV_RDMA_DEVICES is required when RDMA is selected:
without it the plugin has no HCA to bind and falls back to TCP.
The staging knobs control how much pinned host memory each rank
reserves for RDMA staging buffers. With TP=8, the per-host
commitment is 8 × MEMKV_STAGING_SIZE_MB.
Step 4: launch sglang with HiCache pointed at MemKV
Tell sglang to enable HiCache and select the dynamic storage backend pointing at the plugin's class:
docker run -d --name sglang-memkv \
--runtime=nvidia --net=host --shm-size=64g \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e MEMKV_SERVERS="host-a:9900,host-b:9900" \
-e MEMKV_AUTH_KEY="$AUTH_KEY" \
-e MEMKV_TRANSPORT=tcp \
-e MEMKV_LICENSE=/minio.license \
-v /path/to/models:/inference-models:ro \
-v /path/to/minio.license:/minio.license:ro \
-v /path/to/wheels:/plugins:ro \
--entrypoint bash \
lmsysorg/sglang:<tag> -lc '
pip install /plugins/memkv_sglang-*.whl &&
python -m sglang.launch_server \
--model-path /inference-models/<model-dir> \
--host 0.0.0.0 --port 8810 \
--tp <N> \
--trust-remote-code \
--enable-hierarchical-cache \
--hicache-storage-backend dynamic \
--hicache-storage-backend-extra-config "{\"backend_name\":\"memkv\",\"module_path\":\"memkv_sglang.backend\",\"class_name\":\"MemKVHiCacheStorage\"}"
'A clean startup logs (per TP rank, in order):
Successfully installed memkv-sglang-<version>— pip ran fine.Creating dynamic storage backend 'memkv' (memkv_sglang.backend.MemKVHiCacheStorage)— sglang loaded the plugin module.INFO Creating memkv engine config=PluginConfig {...}— the plugin constructed itsEngine.INFO memkv-client license verified plan=...— the license is valid.INFO Server mapped to rail server="..."— each configured MemKV server is registered with the router.INFO MemKVHiCacheStorage ready (servers=[...], rdma=..., suffix=...)— the plugin is fully online.
The suffix carries the model name plus, for non-MLA models,
the TP rank/size; for MLA models the TP dimensions are dropped
on purpose (MLA already produces TP-shape-independent keys).
When pipeline parallelism is used (pp_size > 1), the PP
coordinates are appended too. Net effect: multiple sglang
processes pointed at the same MemKV cluster do not collide on
keys.
How HiCache uses MemKV
- Writes (host → MemKV): when HiCache decides to back a
populated host-cache page up to storage (controlled by the
--hicache-write-policyflag), the plugin's batched store call fires. Each page becomes one MemKVputkeyed by the page hash plus the(model, tp_rank, tp_size)suffix. - Reads (MemKV → host): when the radix tree finds a prefix node whose host pages have been evicted, HiCache asks the plugin to refill them. The plugin fetches bytes from MemKV and writes them into the host-cache buffer before prefill resumes.
- Existence checks: before triggering a multi-page restore, the controller asks the backend which pages are present. The plugin groups keys by primary server and issues one batched exists call per server.
To force traffic through the storage backend during testing,
shrink the host cache below the working set (--hicache-size
must remain larger than the device cache, per sglang's protocol
requirement, but a tighter ratio makes eviction routine).
What this integration buys
- Capacity beyond per-replica RAM. Once the host cache fills, evicted pages flow to MemKV instead of being dropped. The aggregate prefix cache becomes (host RAM × replicas) + MemKV cluster.
- Cross-replica sharing. sglang processes pointed at the
same MemKV cluster share one chunk pool, keyed by
(model, tp_rank, tp_size). Two replicas serving the same workload can read each other's evicted pages. - Durability. On-drive shards survive sglang restarts. (See Operational notes for the cold-restart caveat.)
- HMAC auth. Every op is HMAC-authenticated with the cluster-wide shared key.
Operational notes
- No cold-restart auto-warming. HiCache builds its radix
tree lazily from request traffic, so a fresh process never
asks the backend whether prior keys exist for the same
(model, tp_rank, tp_size)— it just re-prefills. The bytes are still in MemKV; the engine doesn't ask. This is sglang behavior, not the plugin's. Workaround: warm the radix with synthetic traffic on startup, or wait for sglang to add a discover-on-startup flow. - Per-rank connections. With TP=N you will see N TCP sessions per MemKV server. The plugin does not share the client across ranks because each rank lives in its own scheduler subprocess.
- License is mandatory. The plugin verifies the license at
construction; without one it raises at load time and sglang
aborts startup. Mount the license file and set
MEMKV_LICENSE(or configure viaMEMKV_CONFIG). - Hybrid models (Mamba + MLA). sglang's HiCache + Mamba + MLA combination is bleeding-edge in upstream and trips its own internal kernels today. Pure-attention models (Llama, Qwen2.5, Mistral, Gemma) work cleanly. Mamba support in HiCache is on sglang's side to land.
Roadmap
- Cross-restart cold start. Either a sglang-side discover-on-startup pass, or an out-of-band warming script that walks expected prefix keys and asks the plugin to load them.
- RDMA fast path inside Docker by exposing
/dev/infinibandand the matching capabilities; the plugin flips onto an RDMA READ direct-into-buffer path for reads. - Pre-registered host pool MR so RDMA reads do not pay a per-call buffer registration cost.
References
Dynamo + MemKV
Run NVIDIA Dynamo with MemKV as the remote KV-cache tier behind KVBM. Drop the NIXL plugin in, set a handful of env vars, and let KVBM offload evicted blocks into a MemKV cluster over RDMA.
vLLM + MemKV
Run vLLM with MemKV as the durable, shareable storage tier behind LMCache. Set up the plugin, point LMCache at it, and let vLLM serve.