MemKV
Integrate

sglang + MemKV

Run sglang with MemKV as the durable, shareable storage tier behind HiCache. Set up the plugin, point sglang at it, and let HiCache flow KV pages into MemKV.

sglang's HiCache is a tiered prefix cache — a device cache on the GPU plus a host cache in pinned CPU RAM — backed by a pluggable storage tier through StorageBackendFactory. The MemKV plugin slots in alongside the built-in file, nixl, mooncake, hf3fs, eic, aibrix, and simm backends. sglang never sees MemKV directly: it talks to the HiCache controller, the controller talks to the plugin, the plugin talks to a MemKV cluster.

The rest of this page is the wire-up: HiCache lands evicted host-cache pages in MemKV and reads them back on prefix-cache hits.

What you need

  • A running MemKV cluster (one or more nodes).
  • A MemKV license file (minio.license).
  • The MemKV auth key (32-byte HMAC, hex-encoded).
  • The memkv_sglang wheel for your target platform.
  • An installation of sglang (the official lmsysorg/sglang image is the simplest path).
  • One or more RDMA NICs visible on the GPU host if you want the fast path — set MEMKV_RDMA_DEVICES=mlx5_0,mlx5_1 to bind them.

Step 1: bring up MemKV

Use the standard MemKV deployment flow. Once the cluster is up, each node listens on TCP :9900 for the wire protocol and HTTP :9901 for admin (by default; data_port + 1).

Step 2: install the plugin wheel

The memkv_sglang wheel is a Python package with a native extension. Download the build that matches your platform and install it into the same Python environment sglang runs in:

curl -LO https://dl.minio.io/aistor/memkv/release/linux-amd64/memkv_sglang-latest-cp39-abi3-linux_x86_64.whl
pip install ./memkv_sglang-latest-cp39-abi3-linux_x86_64.whl

Inside a container, mount the wheel directory and pip-install at startup (see the docker run example in Step 4).

Step 3: configure the MemKV connection

The plugin reads the standard MemKV config chain — MEMKV_CONFIG yaml first, then MEMKV_* env vars. For most operators the env vars are enough:

export MEMKV_SERVERS="host-a:9900,host-b:9900"
export MEMKV_AUTH_KEY="<64-hex-char auth key>"
export MEMKV_TRANSPORT=tcp           # or auto / rdma
export MEMKV_LICENSE=/path/to/minio.license
export MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1"     # required for the RDMA fast path
export MEMKV_STAGING_SIZE_MB=1024             # per-rank staging pool
export MEMKV_STAGING_SLOT_MB=16               # per-op staging slot

MEMKV_TRANSPORT=tcp is the right default inside containers without /dev/infiniband exposed. For the RDMA fast path inside Docker, add --device=/dev/infiniband and --cap-add=IPC_LOCK --ulimit memlock=-1 to the run command and set MEMKV_TRANSPORT=auto (or rdma to fail loudly if no HCA is visible). MEMKV_RDMA_DEVICES is required when RDMA is selected: without it the plugin has no HCA to bind and falls back to TCP.

The staging knobs control how much pinned host memory each rank reserves for RDMA staging buffers. With TP=8, the per-host commitment is 8 × MEMKV_STAGING_SIZE_MB.

Step 4: launch sglang with HiCache pointed at MemKV

Tell sglang to enable HiCache and select the dynamic storage backend pointing at the plugin's class:

docker run -d --name sglang-memkv \
    --runtime=nvidia --net=host --shm-size=64g \
    -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    -e MEMKV_SERVERS="host-a:9900,host-b:9900" \
    -e MEMKV_AUTH_KEY="$AUTH_KEY" \
    -e MEMKV_TRANSPORT=tcp \
    -e MEMKV_LICENSE=/minio.license \
    -v /path/to/models:/inference-models:ro \
    -v /path/to/minio.license:/minio.license:ro \
    -v /path/to/wheels:/plugins:ro \
    --entrypoint bash \
    lmsysorg/sglang:<tag> -lc '
      pip install /plugins/memkv_sglang-*.whl &&
      python -m sglang.launch_server \
        --model-path /inference-models/<model-dir> \
        --host 0.0.0.0 --port 8810 \
        --tp <N> \
        --trust-remote-code \
        --enable-hierarchical-cache \
        --hicache-storage-backend dynamic \
        --hicache-storage-backend-extra-config "{\"backend_name\":\"memkv\",\"module_path\":\"memkv_sglang.backend\",\"class_name\":\"MemKVHiCacheStorage\"}"
    '

A clean startup logs (per TP rank, in order):

  • Successfully installed memkv-sglang-<version> — pip ran fine.
  • Creating dynamic storage backend 'memkv' (memkv_sglang.backend.MemKVHiCacheStorage) — sglang loaded the plugin module.
  • INFO Creating memkv engine config=PluginConfig {...} — the plugin constructed its Engine.
  • INFO memkv-client license verified plan=... — the license is valid.
  • INFO Server mapped to rail server="..." — each configured MemKV server is registered with the router.
  • INFO MemKVHiCacheStorage ready (servers=[...], rdma=..., suffix=...) — the plugin is fully online.

The suffix carries the model name plus, for non-MLA models, the TP rank/size; for MLA models the TP dimensions are dropped on purpose (MLA already produces TP-shape-independent keys). When pipeline parallelism is used (pp_size > 1), the PP coordinates are appended too. Net effect: multiple sglang processes pointed at the same MemKV cluster do not collide on keys.

How HiCache uses MemKV

  • Writes (host → MemKV): when HiCache decides to back a populated host-cache page up to storage (controlled by the --hicache-write-policy flag), the plugin's batched store call fires. Each page becomes one MemKV put keyed by the page hash plus the (model, tp_rank, tp_size) suffix.
  • Reads (MemKV → host): when the radix tree finds a prefix node whose host pages have been evicted, HiCache asks the plugin to refill them. The plugin fetches bytes from MemKV and writes them into the host-cache buffer before prefill resumes.
  • Existence checks: before triggering a multi-page restore, the controller asks the backend which pages are present. The plugin groups keys by primary server and issues one batched exists call per server.

To force traffic through the storage backend during testing, shrink the host cache below the working set (--hicache-size must remain larger than the device cache, per sglang's protocol requirement, but a tighter ratio makes eviction routine).

What this integration buys

  • Capacity beyond per-replica RAM. Once the host cache fills, evicted pages flow to MemKV instead of being dropped. The aggregate prefix cache becomes (host RAM × replicas) + MemKV cluster.
  • Cross-replica sharing. sglang processes pointed at the same MemKV cluster share one chunk pool, keyed by (model, tp_rank, tp_size). Two replicas serving the same workload can read each other's evicted pages.
  • Durability. On-drive shards survive sglang restarts. (See Operational notes for the cold-restart caveat.)
  • HMAC auth. Every op is HMAC-authenticated with the cluster-wide shared key.

Operational notes

  • No cold-restart auto-warming. HiCache builds its radix tree lazily from request traffic, so a fresh process never asks the backend whether prior keys exist for the same (model, tp_rank, tp_size) — it just re-prefills. The bytes are still in MemKV; the engine doesn't ask. This is sglang behavior, not the plugin's. Workaround: warm the radix with synthetic traffic on startup, or wait for sglang to add a discover-on-startup flow.
  • Per-rank connections. With TP=N you will see N TCP sessions per MemKV server. The plugin does not share the client across ranks because each rank lives in its own scheduler subprocess.
  • License is mandatory. The plugin verifies the license at construction; without one it raises at load time and sglang aborts startup. Mount the license file and set MEMKV_LICENSE (or configure via MEMKV_CONFIG).
  • Hybrid models (Mamba + MLA). sglang's HiCache + Mamba + MLA combination is bleeding-edge in upstream and trips its own internal kernels today. Pure-attention models (Llama, Qwen2.5, Mistral, Gemma) work cleanly. Mamba support in HiCache is on sglang's side to land.

Roadmap

  • Cross-restart cold start. Either a sglang-side discover-on-startup pass, or an out-of-band warming script that walks expected prefix keys and asks the plugin to load them.
  • RDMA fast path inside Docker by exposing /dev/infiniband and the matching capabilities; the plugin flips onto an RDMA READ direct-into-buffer path for reads.
  • Pre-registered host pool MR so RDMA reads do not pay a per-call buffer registration cost.

References