KV Store ABI
Vendor-neutral C ABI (kv_store_v1) for inference engines to persist KV state through any pluggable backend — a small dlopen contract that storage vendors can implement once and ship to llama.cpp and other consumers.
kv_store_v1 is a small C ABI between an inference engine (consumer)
and a storage backend (vendor). The consumer says "put these bytes
under this hash" and "give me back the bytes for this hash"; the
backend implements those primitives over any durable substrate that
holds bytes by key — local filesystem, MemKV, Redis, S3, FoundationDB,
NVMe-over-fabrics.
This page is the spec. The reference consumer is the llama.cpp fork
(feat/v2-chunked-slot-save); the reference backend is
kv-store-memkv.
A new vendor can implement the ABI without reading either codebase.
Conceptual model
The ABI is two namespaces and seven function pointers:
- Chunks — content-addressed, immutable. Keyed by raw hash bytes the consumer chose (today: 8-byte xxh3-64). Putting an already- present chunk is a no-op. Backends may buffer puts and only flush them at the next manifest write.
- Manifests — name-addressed, mutable, atomic. A manifest is the consumer's record of which chunks comprise one persistent object. When the manifest is visible to a reader, every chunk it references must already be visible too.
Save flow inside the consumer:
for each chunk to write:
backend.put_chunk(hash, data)
backend.put_manifest(name, manifest_blob)Restore flow:
manifest_blob = backend.get_manifest(name)
manifest = decode(manifest_blob)
backend.prefetch_chunks(manifest.hashes) # optional, vtable v2
for each chunk hash in manifest:
data = backend.get_chunk(hash)
... reconstruct ...Everything else is the consumer's business: how it computes hashes, what's inside a chunk, what the manifest layout looks like. The backend treats both as opaque bytes.
The vtable
typedef struct kv_store_v1 kv_store_v1;
typedef struct {
uint32_t version; // 1 today; 2 if prefetch_chunks is non-NULL
kv_store_v1 * (*open)(const char * uri);
void (*close)(kv_store_v1 * self);
int (*put_chunk)(kv_store_v1 * self,
const uint8_t * hash, size_t hash_len,
const uint8_t * data, size_t data_len);
int (*get_chunk)(kv_store_v1 * self,
const uint8_t * hash, size_t hash_len,
uint8_t ** out_data, size_t * out_len);
int (*put_manifest)(kv_store_v1 * self,
const char * name,
const uint8_t * data, size_t data_len);
int (*get_manifest)(kv_store_v1 * self,
const char * name,
uint8_t ** out_data, size_t * out_len);
int (*delete_manifest)(kv_store_v1 * self, const char * name);
/* version 2 */
int (*prefetch_chunks)(kv_store_v1 * self,
const uint8_t * hashes,
size_t hash_len, size_t n_hashes);
} kv_store_vtable;A backend ships as a shared object exporting one symbol:
const kv_store_vtable * kv_store_get_vtable(void);The consumer does dlopen("libkv_store_<scheme>.{so,dylib}"),
dlsym("kv_store_get_vtable"), calls it once to obtain the vtable,
then vtable.open(uri) to spin up an instance.
Method semantics
open(uri) -> kv_store_v1 *
Construct a backend instance from a URI. The URI shape is backend- specific; the only requirement is that the consumer passes through exactly what it received on its CLI.
Examples:
memkv://10.0.0.1:9900/llama-prodredis://cache.svc:6379/3s3://my-bucket/kv/
Return NULL on failure. Errors should be logged to stderr by the
backend; the consumer reports a generic dlopen-or-open failure.
close(self)
Release every resource the backend opened. The handle MUST NOT be
used after close. Idempotency is a courtesy — the consumer will not
call close twice but may pass NULL if open failed mid-flight.
put_chunk(self, hash, hash_len, data, data_len) -> int
Store data (length data_len) under the binary key hash (length
hash_len). Idempotent: putting an already-present hash is a no-op
on the wire and on disk.
Returns 0 on success, 1 if the chunk already existed (the
consumer counts these as dedup hits), <0 on error.
The backend MAY buffer the put in memory and flush it at the next
put_manifest call. Two consequences for the consumer:
- A put is not visible to a different reader until the matching
put_manifesthas returned. - Calls MUST be ordered: every
put_chunkfor a save MUST land before the matchingput_manifest.
The MemKV reference backend buffers and flushes; the in-tree local-fs backend writes through immediately.
get_chunk(self, hash, hash_len, out_data, out_len) -> int
Fetch the bytes stored under hash. Returns 0 on success, <0 on
error or missing chunk.
On success, the backend allocates *out_data with the C malloc
function and writes *out_len. The consumer is responsible for
freeing the buffer when done.
The buffer's bytes MUST be byte-identical to what was passed in
to put_chunk. Backends that compress, encrypt, or shard internally
must reverse those transformations transparently.
put_manifest(self, name, data, data_len) -> int
Atomically publish data under the string key name. After this
returns successfully, every chunk previously put_chunk'd on this
handle MUST be readable.
Atomicity is the point: a concurrent reader either sees the previous
value of the manifest (if any) or the new value, never a partial
write. On a local FS this is tmp + rename(2); on MemKV it's a
single PUT.
Returns 0 on success, <0 on error.
get_manifest(self, name, out_data, out_len) -> int
Same shape as get_chunk but keyed by name (a NUL-terminated C
string). Memory ownership: *out_data is malloc'd by the backend
and free'd by the consumer.
delete_manifest(self, name) -> int
Best-effort delete of the manifest under name. Not finding the key
is success, not failure. Whether to delete the chunks the manifest
referenced is not specified by the ABI — most backends leave
chunks alone (they may be referenced from other manifests) and rely
on a separate refcount/GC pass. The consumer MUST be prepared for
either behaviour.
prefetch_chunks(self, hashes, hash_len, n_hashes) -> int (vtable v2)
Hint that the consumer is about to issue get_chunk for each of
n_hashes chunks (laid out as n_hashes contiguous hash_len-byte
keys in hashes). The backend may batch-fetch them in one round-
trip and serve subsequent get_chunk calls from a local cache.
Returns 0 on success, <0 on error. Failure is non-fatal — the
consumer falls back to per-chunk get_chunk calls.
May be NULL on a v1 vtable. Consumers MUST check
vtable.version >= 2 && vtable.prefetch_chunks != NULL before
calling.
Memory ownership
| pointer | lifetime |
|---|---|
input hash / data / name | borrowed for the call only; backend MUST NOT retain after return |
output *out_data | allocated by backend with malloc; consumer frees with free |
kv_store_v1 * handle | owned by consumer; opaque to anyone else; closed once with close |
Backends in non-C languages (Rust, Go, C++) MUST funnel the output
allocation through the C malloc so the consumer's free works.
The reference Rust backend uses extern "C" { fn malloc(size_t) -> *mut c_void; }
explicitly.
Threading
A single kv_store_v1 handle MUST be safe to call from multiple
threads concurrently. In practice the consumer (llama-server's slot
queue, etc.) typically uses one handle from a
serialized worker thread, so the ABI does not require fine-grained
concurrency primitives. But the backend cannot assume single-threaded
access.
open and close are called once per backend lifetime by the
consumer; threading on those is not a concern.
Versioning
The version field on the vtable is the wire-version of the ABI.
version = 1—open,close,put_chunk,get_chunk,put_manifest,get_manifest,delete_manifest.prefetch_chunksisNULLor undefined.version = 2— addsprefetch_chunksas the trailing field.
New methods are always added at the end of the struct so a v1-only
consumer reading a v2 vtable still gets the v1 layout correctly. New
consumers MUST check version >= N (and that the function pointer
is non-NULL) before invoking any method added in version N.
Breaking changes (renaming methods, changing signatures) bump the
struct identity entirely (kv_store_v2 etc.). The dlopen symbol
becomes kv_store_get_vtable_v2. The current ABI is v1 and we
expect to stay there.
URI conventions
- Scheme is the backend identifier and MUST match the cdylib name:
scheme
foo→libkv_store_foo.{so,dylib}. - Authority and path are backend-specific.
- The consumer MAY trim a trailing
/on the URI before callingopen. The reference llama.cpp fork strips one trailing/.
The consumer constructs the URI by concatenating --slot-save-path
(or its equivalent) with a per-object basename. Backends that want
to host multiple tenants on one URI prefix must split the namespace
themselves.
Library naming and loading
scheme "foo"
cdylib libkv_store_foo.so (Linux)
libkv_store_foo.dylib (macOS)
symbol kv_store_get_vtableThe consumer searches:
$KV_STORE_LIBRARY_PATH/<libname>if the env var is set.- The system dynamic loader path (
LD_LIBRARY_PATH,RPATH,/usr/lib, etc.). On macOS,DYLD_LIBRARY_PATHis stripped from many child processes — consumers that target macOS should honourKV_STORE_LIBRARY_PATHto provide an explicit fallback.
A backend MAY ship multiple cdylibs under one scheme (one per build variant); the consumer uses whichever is found first.
Error reporting
All methods return int:
0— success.1—put_chunkonly: chunk already existed (idempotent hit).<0— failure. The exact value is unspecified; backends should log details tostderr. Consumers treat any negative return as a generic backend failure and fall back to a degraded path (e.g. legacy save format) where possible.
Backends MUST NOT panic across the FFI boundary. Rust backends use
std::panic::catch_unwind at every entry point and convert panics
to a logged error plus a <0 return.
Reference implementations
-
Local FS — in-tree in the llama.cpp fork (
tools/server/slot_v2.cpp⇒local_fs_store). Writes manifest at<dir>/<name>and chunks at<dir>/chunks/<hash[0]>/<hash>. 16-way subdirectory fanout. Atomic viatmp+rename. Default when--slot-save-pathis a plain filesystem path (no://). -
MemKV — out-of-tree, shipped with the MemKV release as the
kv-store-memkvcrate. Cdyliblibkv_store_memkv.{so,dylib}. URI:memkv://<host>:<port>/<namespace>. Auth via envMEMKV_AUTH_KEY. Buffered puts flushed via EXISTS + batch_put. Implementsprefetch_chunksviabatch_get.
Writing a new backend
Minimal C skeleton:
#include "kv_store_abi.h"
#include <stdlib.h>
struct kv_store_v1 {
/* whatever the backend needs */
};
static kv_store_v1 *backend_open(const char *uri) {
/* parse uri, connect, return handle (heap) */
}
static void backend_close(kv_store_v1 *s) { /* release */ }
static int backend_put_chunk(kv_store_v1 *s,
const uint8_t *h, size_t hl,
const uint8_t *d, size_t dl) {
/* if hash already present: return 1 */
/* else: store and return 0 */
}
static int backend_get_chunk(kv_store_v1 *s,
const uint8_t *h, size_t hl,
uint8_t **out, size_t *out_len) {
void *buf = malloc(...); /* must be malloc, not new/calloc */
/* fill buf */
*out = buf;
*out_len = ...;
return 0;
}
/* put_manifest, get_manifest, delete_manifest similarly */
static const kv_store_vtable VT = {
.version = 1, /* or 2 if prefetch_chunks is implemented */
.open = backend_open,
.close = backend_close,
.put_chunk = backend_put_chunk,
.get_chunk = backend_get_chunk,
.put_manifest = backend_put_manifest,
.get_manifest = backend_get_manifest,
.delete_manifest = backend_delete_manifest,
.prefetch_chunks = NULL, /* or fill at v2 */
};
const kv_store_vtable *kv_store_get_vtable(void) { return &VT; }Build as a shared library, expose kv_store_get_vtable, place the
result on the consumer's loader path. No registration step, no
configuration file — dlopen does the rest.
For Rust backends, see the kv-store-memkv crate in the MemKV
repository as a working example: it exports the same vtable from a
#[repr(C)] struct with static lifetime, wraps every entry point
in std::panic::catch_unwind, and routes output buffers through a
manual malloc for symmetry with the consumer's free.
Vendor checklist
Before shipping a backend, verify:
kv_store_get_vtableis the only exported symbol the consumer needs (everything else can be hidden).- The vtable's
versionis set to the highest level you implement;prefetch_chunksisNULLif you stayed at v1. put_chunkis idempotent on duplicate hashes (returns1, no side effects).put_manifestis atomic against concurrent readers.get_chunkandget_manifestallocate output viamallocso consumers canfree.- Panics or unhandled exceptions cannot cross the FFI boundary.
- The cdylib is named
libkv_store_<scheme>.{so,dylib}exactly. - Auth credentials, region, and other secrets are configured via environment variables, not the URI (so they don't end up in process listings or log lines).
Why this exists
KV state offload is the single largest prefill latency win for long-context inference on workstation- and laptop-class hardware. Mac mini and Mac Studio hosts running llama.cpp hit the same wall as servers — fresh requests re-prefill the same 5–30k tokens — but the heavyweight transfer abstractions GPU-stack engines use don't fit.
This ABI is the smallest seam that lets a storage vendor ship one
plugin an inference engine can dlopen: a vtable of eight function
pointers in v2 (seven required, plus the optional prefetch_chunks
hint), two namespaces, no opinions about the bytes. The reference
consumer today is minio/llama.cpp on TCP-only hosts; the same plugin
works for any future consumer that adopts the contract.
The same MemKV cluster serves both workstation deployments and larger fleets, so a system prompt populated on a developer's Mac mini and one populated server-side share the same chunk pool (modulo model-id matching).
References
- Canonical header:
kv_store_abi.h - KV store on-disk layout: llama.cpp + MemKV
- MemKV backend source: the
kv-store-memkvcrate ships with the MemKV release
llama.cpp + MemKV
Durable KV store for llama-server backed by MemKV. Multi-turn chats, multi-tenant deployments, and agent loops resume in milliseconds instead of re-prefilling tokens.
CLI Reference
Complete reference for the memkv command-line interface — every subcommand, flag, default, and exit-code semantics.