MemKV
Integrate

llama.cpp + MemKV

Durable KV store for llama-server backed by MemKV. Multi-turn chats, multi-tenant deployments, and agent loops resume in milliseconds instead of re-prefilling tokens.

llama-server keeps per-session KV state live in GPU memory only. When the populating request finishes, that state is gone — every next turn re-prefills the same tokens.

The slot save/restore HTTP API addresses this by writing per-session KV state to disk and reading it back on demand. The default v1 format writes one opaque blob per session and rewrites the whole file on every save; overhead grows linearly with session size.

We've shipped a v2:

  • Manifest plus content-addressed chunks (xxh3-64). The manifest is a few kilobytes; chunks are deduplicated across saves and sessions.
  • A pluggable backend ABI loaded via dlopen. In-tree backend writes to local disk; the MemKV backend ships as libkv_store_memkv.{so,dylib} and writes chunks to a MemKV cluster.
  • EXISTS-skip + pipelined batch_put collapse a 96-chunk save into one EXISTS round-trip plus only the missing chunks. A prefetch hint collapses restore into one batched fetch.

The contract is the kv_store_v1 ABI — a vendor-neutral C interface any engine can consume and any storage vendor can implement. MemKV is one reference backend; nothing prevents Redis, S3, FoundationDB, or a homegrown shard layer from shipping another.

Why MemKV is the natural backend

The v2 design has nothing MemKV-specific in the llama-server build — the chunk-store ABI is a C header (eight function pointers in v2: chunk put/get, manifest put/get/delete, optional prefetch) and any K/V system can implement it. Why MemKV fits well:

  • Latency — chunk PUTs and the EXISTS pre-check ride one long-lived TCP connection. Loopback: cold save ~10 ms, warm save ~1 ms.
  • Sharing — llama-server processes pointed at the same cluster share one chunk pool, so a recurring system prompt is stored once.
  • Durability — on-disk shards survive llama-server restarts; sessions reload across crashes.
  • HMAC auth — chunks and manifests inherit the same per-message HMAC as every other MemKV op.

Today the libkv_store_memkv backend speaks TCP only — the chunk ABI is small and synchronous, so the simpler transport is the right fit. An RDMA-direct chunk path is on the roadmap below.

What changes vs the default --slot-save-format legacy

The default v1 path writes a single binary blob per session. Every save rewrites the whole file even when only the tail changed. There is no sharing across sessions.

v1 (default)v2 + MemKV
Unit of save1 file per sessionmanifest + N content-addr chunks
Per-turn costO(session) bytes rewrittenO(new tail) chunks PUT
Cross-session prefix dedupnoyes (xxh3-64)
Survives restartyes (local file)yes (MemKV cluster)
Pluggable backendnoyes (dlopen ABI)
Remote / shared storenoyes
HMAC-authenticated transportnoyes (MemKV-native)

Setup

Three pieces:

  1. A running MemKV cluster (one node or many).
  2. The llama.cpp build that contains the v2 slot save path.
  3. The libkv_store_memkv cdylib next to llama-server (or on the loader's search path).

MemKV server

Any standard MemKV deployment works. For local development:

memkv start --config /etc/memkv/config.yaml \
            --license /path/to/minio.license \
            --log-file /var/log/memkv.log

The auth key in config.yaml's network.auth_key is the 32-byte HMAC key the backend will use; export it for llama-server below.

llama.cpp with v2 slot save

Build from github.com/minio/llama.cpp, branch feat/v2-chunked-slot-save:

git clone --branch feat/v2-chunked-slot-save https://github.com/minio/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_METAL=ON      # or -DGGML_CUDA=ON on Linux
cmake --build build --target llama-server

The two new flags:

  • --slot-save-format {legacy|chunked} — pick the on-disk format. Default is legacy (v1 single-blob).
  • --slot-chunk-tokens N — token-window size for chunked. Default 256. Smaller windows give better cross-session dedup at the cost of more chunk objects.

MemKV backend cdylib

Download the prebuilt cdylib for your platform:

# Linux amd64
curl -LO https://dl.minio.io/aistor/memkv/release/linux-amd64/libkv_store_memkv.so

# Linux arm64
curl -LO https://dl.minio.io/aistor/memkv/release/linux-arm64/libkv_store_memkv.so

# macOS (Apple Silicon)
curl -LO https://dl.minio.io/aistor/memkv/release/darwin-arm64/libkv_store_memkv.dylib

Place the file on the dynamic loader's search path. Two ways:

  • Set KV_STORE_LIBRARY_PATH=/path/to/dir (recommended on macOS — DYLD_LIBRARY_PATH is stripped from many child processes).
  • Or copy the file next to the llama-server binary.

Running it

export MEMKV_AUTH_KEY=<64 hex chars matching network.auth_key>
export KV_STORE_LIBRARY_PATH=/path/to/dir/with/libkv_store_memkv

llama-server \
    -m model.gguf \
    --slot-save-path memkv://127.0.0.1:9900/llama \
    --slot-save-format chunked \
    --slot-chunk-tokens 256

The memkv:// URL tells llama-server to dlopen libkv_store_memkv and route chunk PUTs/GETs through it. The path component (llama above) is the namespace prefix used for the manifest and chunk keys, so multiple llama-server tenants on one cluster do not collide.

What you get

Numbers below are from a Mac mini (M4, 10 cores, 16 GiB) running Qwen2.5-0.5B-Instruct against a co-located MemKV server (8 file-mode shards on internal Apple Fabric SSD, TCP loopback). W = 64 tokens.

Per-save bandwidth

save typebytes on the wire
v1 (legacy) at turn 3 (257 tokens)3,150,428
v2 chunked manifest at turn 35,728
v2 chunked, 5-session shared-prefix save1,952

The manifest is hundreds of times smaller than a v1 save of the same session because it carries hashes, not tensor bytes.

Save and restore wall

operationv1 (local-fs)v2 + MemKV (loopback)
cold save (96 new chunks)0.5–0.7 ms10–16 ms
warm save (all chunks dedup)~1 ms
restore (256-token session)0.3 ms2.6–4.3 ms

The warm-save floor is the point of the design. In a multi-turn chat the new tail at turn N+1 differs from turn N by at most a few chunks; EXISTS classifies all the others as already-on-server in one round trip and the manifest pin is a single PUT.

Cumulative storage (4-turn chat)

formattotal bytes after turn 3
v1 (legacy)8,221,488 B
v2 chunked, W=643,505,856 B
v2 chunked, W=83,336,896 B

A 4-turn chat shrinks 57% in v2 — every turn shares chunks with the previous turn instead of writing a fresh full blob.

Cross-session prefix dedup

Five sessions sharing a 25-token system prompt:

formattotal bytes after 5 sessions
v1 (legacy)4,210,988 B
v2 chunked, W=644,212,288 B
v2 chunked, W=82,554,560 B

W must be smaller than the shared prefix to dedup; at W=8 the prefix spans three full windows and v2 saves 39% over v1.

Operational notes

  • Single TCP connection per llama-server — the kv-store abstraction caches the open store across save/restore calls, so the TCP connect is paid once per llama-server process.
  • No GC — orphaned chunks (manifest deleted, chunks live) accumulate on the MemKV side. A periodic refcount sweep is on the roadmap; for short-lived sessions this is not yet a problem.
  • License — the chunk-store backend uses the same MemKV cluster as other workloads; license bookkeeping happens server-side and the backend does not require its own license file.
  • v_trans=true models — Some architectures use a transposed V layout. v2 emits one chunk per layer's V slab in that case (no token-window split on V), which preserves correctness and same-token dedup but loses cross-prefix dedup on V. Models with v_trans=false (most Qwen, Llama, Mistral, Gemma) get the full design.
  • MRoPE / per-cell-ext models — not yet supported by the v2 parser; llama-server falls back to v1 transparently for those.

Roadmap

  • Bigger pipelined window in batch_put — a 32-deep or 64-deep window shaves several milliseconds off cold saves.
  • BLAKE3 as an alternate hash_alg for users who want collision resistance against an adversary, not just against accidents.
  • An RDMA-direct chunk path that uses MemKV's existing RDMA Read/Write primitives instead of TCP batch_put. With the chunk laid out in a registered slab, the hot path becomes one-sided RDMA reads.
  • Reference-counted manifests so deleting a session walks its chunk table and decrements; reaches zero, the chunk goes away.

References