llama.cpp + MemKV
Durable KV store for llama-server backed by MemKV. Multi-turn chats, multi-tenant deployments, and agent loops resume in milliseconds instead of re-prefilling tokens.
llama-server keeps per-session KV state live in GPU memory only.
When the populating request finishes, that state is gone — every
next turn re-prefills the same tokens.
The slot save/restore HTTP API addresses this by writing per-session KV state to disk and reading it back on demand. The default v1 format writes one opaque blob per session and rewrites the whole file on every save; overhead grows linearly with session size.
We've shipped a v2:
- Manifest plus content-addressed chunks (xxh3-64). The manifest is a few kilobytes; chunks are deduplicated across saves and sessions.
- A pluggable backend ABI loaded via
dlopen. In-tree backend writes to local disk; the MemKV backend ships aslibkv_store_memkv.{so,dylib}and writes chunks to a MemKV cluster. - EXISTS-skip + pipelined
batch_putcollapse a 96-chunk save into one EXISTS round-trip plus only the missing chunks. Aprefetchhint collapses restore into one batched fetch.
The contract is the kv_store_v1 ABI — a
vendor-neutral C interface any engine can consume and any storage
vendor can implement. MemKV is one reference backend; nothing
prevents Redis, S3, FoundationDB, or a homegrown shard layer from
shipping another.
Why MemKV is the natural backend
The v2 design has nothing MemKV-specific in the llama-server build — the chunk-store ABI is a C header (eight function pointers in v2: chunk put/get, manifest put/get/delete, optional prefetch) and any K/V system can implement it. Why MemKV fits well:
- Latency — chunk PUTs and the EXISTS pre-check ride one long-lived TCP connection. Loopback: cold save ~10 ms, warm save ~1 ms.
- Sharing — llama-server processes pointed at the same cluster share one chunk pool, so a recurring system prompt is stored once.
- Durability — on-disk shards survive llama-server restarts; sessions reload across crashes.
- HMAC auth — chunks and manifests inherit the same per-message HMAC as every other MemKV op.
Today the libkv_store_memkv backend speaks TCP only — the chunk ABI is small
and synchronous, so the simpler transport is the right fit. An RDMA-direct
chunk path is on the roadmap below.
What changes vs the default --slot-save-format legacy
The default v1 path writes a single binary blob per session. Every save rewrites the whole file even when only the tail changed. There is no sharing across sessions.
| v1 (default) | v2 + MemKV | |
|---|---|---|
| Unit of save | 1 file per session | manifest + N content-addr chunks |
| Per-turn cost | O(session) bytes rewritten | O(new tail) chunks PUT |
| Cross-session prefix dedup | no | yes (xxh3-64) |
| Survives restart | yes (local file) | yes (MemKV cluster) |
| Pluggable backend | no | yes (dlopen ABI) |
| Remote / shared store | no | yes |
| HMAC-authenticated transport | no | yes (MemKV-native) |
Setup
Three pieces:
- A running MemKV cluster (one node or many).
- The llama.cpp build that contains the v2 slot save path.
- The
libkv_store_memkvcdylib next to llama-server (or on the loader's search path).
MemKV server
Any standard MemKV deployment works. For local development:
memkv start --config /etc/memkv/config.yaml \
--license /path/to/minio.license \
--log-file /var/log/memkv.logThe auth key in config.yaml's network.auth_key is the 32-byte HMAC
key the backend will use; export it for llama-server below.
llama.cpp with v2 slot save
Build from github.com/minio/llama.cpp, branch
feat/v2-chunked-slot-save:
git clone --branch feat/v2-chunked-slot-save https://github.com/minio/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_METAL=ON # or -DGGML_CUDA=ON on Linux
cmake --build build --target llama-serverThe two new flags:
--slot-save-format {legacy|chunked}— pick the on-disk format. Default islegacy(v1 single-blob).--slot-chunk-tokens N— token-window size forchunked. Default 256. Smaller windows give better cross-session dedup at the cost of more chunk objects.
MemKV backend cdylib
Download the prebuilt cdylib for your platform:
# Linux amd64
curl -LO https://dl.minio.io/aistor/memkv/release/linux-amd64/libkv_store_memkv.so
# Linux arm64
curl -LO https://dl.minio.io/aistor/memkv/release/linux-arm64/libkv_store_memkv.so
# macOS (Apple Silicon)
curl -LO https://dl.minio.io/aistor/memkv/release/darwin-arm64/libkv_store_memkv.dylibPlace the file on the dynamic loader's search path. Two ways:
- Set
KV_STORE_LIBRARY_PATH=/path/to/dir(recommended on macOS —DYLD_LIBRARY_PATHis stripped from many child processes). - Or copy the file next to the
llama-serverbinary.
Running it
export MEMKV_AUTH_KEY=<64 hex chars matching network.auth_key>
export KV_STORE_LIBRARY_PATH=/path/to/dir/with/libkv_store_memkv
llama-server \
-m model.gguf \
--slot-save-path memkv://127.0.0.1:9900/llama \
--slot-save-format chunked \
--slot-chunk-tokens 256The memkv:// URL tells llama-server to dlopen
libkv_store_memkv and route chunk PUTs/GETs through it. The path
component (llama above) is the namespace prefix used for the
manifest and chunk keys, so multiple llama-server tenants on one
cluster do not collide.
What you get
Numbers below are from a Mac mini (M4, 10 cores, 16 GiB) running Qwen2.5-0.5B-Instruct against a co-located MemKV server (8 file-mode shards on internal Apple Fabric SSD, TCP loopback). W = 64 tokens.
Per-save bandwidth
| save type | bytes on the wire |
|---|---|
| v1 (legacy) at turn 3 (257 tokens) | 3,150,428 |
| v2 chunked manifest at turn 3 | 5,728 |
| v2 chunked, 5-session shared-prefix save | 1,952 |
The manifest is hundreds of times smaller than a v1 save of the same session because it carries hashes, not tensor bytes.
Save and restore wall
| operation | v1 (local-fs) | v2 + MemKV (loopback) |
|---|---|---|
| cold save (96 new chunks) | 0.5–0.7 ms | 10–16 ms |
| warm save (all chunks dedup) | — | ~1 ms |
| restore (256-token session) | 0.3 ms | 2.6–4.3 ms |
The warm-save floor is the point of the design. In a multi-turn chat the new tail at turn N+1 differs from turn N by at most a few chunks; EXISTS classifies all the others as already-on-server in one round trip and the manifest pin is a single PUT.
Cumulative storage (4-turn chat)
| format | total bytes after turn 3 |
|---|---|
| v1 (legacy) | 8,221,488 B |
| v2 chunked, W=64 | 3,505,856 B |
| v2 chunked, W=8 | 3,336,896 B |
A 4-turn chat shrinks 57% in v2 — every turn shares chunks with the previous turn instead of writing a fresh full blob.
Cross-session prefix dedup
Five sessions sharing a 25-token system prompt:
| format | total bytes after 5 sessions |
|---|---|
| v1 (legacy) | 4,210,988 B |
| v2 chunked, W=64 | 4,212,288 B |
| v2 chunked, W=8 | 2,554,560 B |
W must be smaller than the shared prefix to dedup; at W=8 the prefix spans three full windows and v2 saves 39% over v1.
Operational notes
- Single TCP connection per llama-server — the kv-store abstraction caches the open store across save/restore calls, so the TCP connect is paid once per llama-server process.
- No GC — orphaned chunks (manifest deleted, chunks live) accumulate on the MemKV side. A periodic refcount sweep is on the roadmap; for short-lived sessions this is not yet a problem.
- License — the chunk-store backend uses the same MemKV cluster as other workloads; license bookkeeping happens server-side and the backend does not require its own license file.
v_trans=truemodels — Some architectures use a transposed V layout. v2 emits one chunk per layer's V slab in that case (no token-window split on V), which preserves correctness and same-token dedup but loses cross-prefix dedup on V. Models withv_trans=false(most Qwen, Llama, Mistral, Gemma) get the full design.- MRoPE / per-cell-ext models — not yet supported by the v2 parser; llama-server falls back to v1 transparently for those.
Roadmap
- Bigger pipelined window in
batch_put— a 32-deep or 64-deep window shaves several milliseconds off cold saves. - BLAKE3 as an alternate
hash_algfor users who want collision resistance against an adversary, not just against accidents. - An RDMA-direct chunk path that uses MemKV's existing RDMA Read/Write
primitives instead of TCP
batch_put. With the chunk laid out in a registered slab, the hot path becomes one-sided RDMA reads. - Reference-counted manifests so deleting a session walks its chunk table and decrements; reaches zero, the chunk goes away.
References
- llama.cpp v2 chunked slot save design —
docs/kv-store-format.md - llama.cpp v2 implementation —
tools/server/slot_v2.{h,cpp}andtools/server/kv_store_abi.h - MemKV kv-store backend — the
kv-store-memkvcrate, shipped with the MemKV release - Upstream llama.cpp project
vLLM + MemKV
Run vLLM with MemKV as the durable, shareable storage tier behind LMCache. Set up the plugin, point LMCache at it, and let vLLM serve.
KV Store ABI
Vendor-neutral C ABI (kv_store_v1) for inference engines to persist KV state through any pluggable backend — a small dlopen contract that storage vendors can implement once and ship to llama.cpp and other consumers.