MemKV
Integrate

KV Store ABI

Vendor-neutral C ABI (kv_store_v1) for inference engines to persist KV state through any pluggable backend — a small dlopen contract that storage vendors can implement once and ship to llama.cpp and other consumers.

kv_store_v1 is a small C ABI between an inference engine (consumer) and a storage backend (vendor). The consumer says "put these bytes under this hash" and "give me back the bytes for this hash"; the backend implements those primitives over any durable substrate that holds bytes by key — local filesystem, MemKV, Redis, S3, FoundationDB, NVMe-over-fabrics.

This page is the spec. The reference consumer is the llama.cpp fork (feat/v2-chunked-slot-save); the reference backend is kv-store-memkv. A new vendor can implement the ABI without reading either codebase.

Conceptual model

The ABI is two namespaces and seven function pointers:

  • Chunks — content-addressed, immutable. Keyed by raw hash bytes the consumer chose (today: 8-byte xxh3-64). Putting an already- present chunk is a no-op. Backends may buffer puts and only flush them at the next manifest write.
  • Manifests — name-addressed, mutable, atomic. A manifest is the consumer's record of which chunks comprise one persistent object. When the manifest is visible to a reader, every chunk it references must already be visible too.

Save flow inside the consumer:

for each chunk to write:
    backend.put_chunk(hash, data)
backend.put_manifest(name, manifest_blob)

Restore flow:

manifest_blob = backend.get_manifest(name)
manifest = decode(manifest_blob)
backend.prefetch_chunks(manifest.hashes)        # optional, vtable v2
for each chunk hash in manifest:
    data = backend.get_chunk(hash)
    ... reconstruct ...

Everything else is the consumer's business: how it computes hashes, what's inside a chunk, what the manifest layout looks like. The backend treats both as opaque bytes.

The vtable

typedef struct kv_store_v1 kv_store_v1;

typedef struct {
    uint32_t version;            // 1 today; 2 if prefetch_chunks is non-NULL

    kv_store_v1 * (*open)(const char * uri);
    void          (*close)(kv_store_v1 * self);

    int (*put_chunk)(kv_store_v1 * self,
                     const uint8_t * hash, size_t hash_len,
                     const uint8_t * data, size_t data_len);

    int (*get_chunk)(kv_store_v1 * self,
                     const uint8_t * hash, size_t hash_len,
                     uint8_t ** out_data, size_t * out_len);

    int (*put_manifest)(kv_store_v1 * self,
                        const char * name,
                        const uint8_t * data, size_t data_len);

    int (*get_manifest)(kv_store_v1 * self,
                        const char * name,
                        uint8_t ** out_data, size_t * out_len);

    int (*delete_manifest)(kv_store_v1 * self, const char * name);

    /* version 2 */
    int (*prefetch_chunks)(kv_store_v1 * self,
                           const uint8_t * hashes,
                           size_t hash_len, size_t n_hashes);
} kv_store_vtable;

A backend ships as a shared object exporting one symbol:

const kv_store_vtable * kv_store_get_vtable(void);

The consumer does dlopen("libkv_store_<scheme>.{so,dylib}"), dlsym("kv_store_get_vtable"), calls it once to obtain the vtable, then vtable.open(uri) to spin up an instance.

Method semantics

open(uri) -> kv_store_v1 *

Construct a backend instance from a URI. The URI shape is backend- specific; the only requirement is that the consumer passes through exactly what it received on its CLI.

Examples:

  • memkv://10.0.0.1:9900/llama-prod
  • redis://cache.svc:6379/3
  • s3://my-bucket/kv/

Return NULL on failure. Errors should be logged to stderr by the backend; the consumer reports a generic dlopen-or-open failure.

close(self)

Release every resource the backend opened. The handle MUST NOT be used after close. Idempotency is a courtesy — the consumer will not call close twice but may pass NULL if open failed mid-flight.

put_chunk(self, hash, hash_len, data, data_len) -> int

Store data (length data_len) under the binary key hash (length hash_len). Idempotent: putting an already-present hash is a no-op on the wire and on disk.

Returns 0 on success, 1 if the chunk already existed (the consumer counts these as dedup hits), <0 on error.

The backend MAY buffer the put in memory and flush it at the next put_manifest call. Two consequences for the consumer:

  1. A put is not visible to a different reader until the matching put_manifest has returned.
  2. Calls MUST be ordered: every put_chunk for a save MUST land before the matching put_manifest.

The MemKV reference backend buffers and flushes; the in-tree local-fs backend writes through immediately.

get_chunk(self, hash, hash_len, out_data, out_len) -> int

Fetch the bytes stored under hash. Returns 0 on success, <0 on error or missing chunk.

On success, the backend allocates *out_data with the C malloc function and writes *out_len. The consumer is responsible for freeing the buffer when done.

The buffer's bytes MUST be byte-identical to what was passed in to put_chunk. Backends that compress, encrypt, or shard internally must reverse those transformations transparently.

put_manifest(self, name, data, data_len) -> int

Atomically publish data under the string key name. After this returns successfully, every chunk previously put_chunk'd on this handle MUST be readable.

Atomicity is the point: a concurrent reader either sees the previous value of the manifest (if any) or the new value, never a partial write. On a local FS this is tmp + rename(2); on MemKV it's a single PUT.

Returns 0 on success, <0 on error.

get_manifest(self, name, out_data, out_len) -> int

Same shape as get_chunk but keyed by name (a NUL-terminated C string). Memory ownership: *out_data is malloc'd by the backend and free'd by the consumer.

delete_manifest(self, name) -> int

Best-effort delete of the manifest under name. Not finding the key is success, not failure. Whether to delete the chunks the manifest referenced is not specified by the ABI — most backends leave chunks alone (they may be referenced from other manifests) and rely on a separate refcount/GC pass. The consumer MUST be prepared for either behaviour.

prefetch_chunks(self, hashes, hash_len, n_hashes) -> int (vtable v2)

Hint that the consumer is about to issue get_chunk for each of n_hashes chunks (laid out as n_hashes contiguous hash_len-byte keys in hashes). The backend may batch-fetch them in one round- trip and serve subsequent get_chunk calls from a local cache.

Returns 0 on success, <0 on error. Failure is non-fatal — the consumer falls back to per-chunk get_chunk calls.

May be NULL on a v1 vtable. Consumers MUST check vtable.version >= 2 && vtable.prefetch_chunks != NULL before calling.

Memory ownership

pointerlifetime
input hash / data / nameborrowed for the call only; backend MUST NOT retain after return
output *out_dataallocated by backend with malloc; consumer frees with free
kv_store_v1 * handleowned by consumer; opaque to anyone else; closed once with close

Backends in non-C languages (Rust, Go, C++) MUST funnel the output allocation through the C malloc so the consumer's free works. The reference Rust backend uses extern "C" { fn malloc(size_t) -> *mut c_void; } explicitly.

Threading

A single kv_store_v1 handle MUST be safe to call from multiple threads concurrently. In practice the consumer (llama-server's slot queue, etc.) typically uses one handle from a serialized worker thread, so the ABI does not require fine-grained concurrency primitives. But the backend cannot assume single-threaded access.

open and close are called once per backend lifetime by the consumer; threading on those is not a concern.

Versioning

The version field on the vtable is the wire-version of the ABI.

  • version = 1open, close, put_chunk, get_chunk, put_manifest, get_manifest, delete_manifest. prefetch_chunks is NULL or undefined.
  • version = 2 — adds prefetch_chunks as the trailing field.

New methods are always added at the end of the struct so a v1-only consumer reading a v2 vtable still gets the v1 layout correctly. New consumers MUST check version >= N (and that the function pointer is non-NULL) before invoking any method added in version N.

Breaking changes (renaming methods, changing signatures) bump the struct identity entirely (kv_store_v2 etc.). The dlopen symbol becomes kv_store_get_vtable_v2. The current ABI is v1 and we expect to stay there.

URI conventions

  • Scheme is the backend identifier and MUST match the cdylib name: scheme foolibkv_store_foo.{so,dylib}.
  • Authority and path are backend-specific.
  • The consumer MAY trim a trailing / on the URI before calling open. The reference llama.cpp fork strips one trailing /.

The consumer constructs the URI by concatenating --slot-save-path (or its equivalent) with a per-object basename. Backends that want to host multiple tenants on one URI prefix must split the namespace themselves.

Library naming and loading

scheme   "foo"
cdylib   libkv_store_foo.so       (Linux)
         libkv_store_foo.dylib    (macOS)
symbol   kv_store_get_vtable

The consumer searches:

  1. $KV_STORE_LIBRARY_PATH/<libname> if the env var is set.
  2. The system dynamic loader path (LD_LIBRARY_PATH, RPATH, /usr/lib, etc.). On macOS, DYLD_LIBRARY_PATH is stripped from many child processes — consumers that target macOS should honour KV_STORE_LIBRARY_PATH to provide an explicit fallback.

A backend MAY ship multiple cdylibs under one scheme (one per build variant); the consumer uses whichever is found first.

Error reporting

All methods return int:

  • 0 — success.
  • 1put_chunk only: chunk already existed (idempotent hit).
  • <0 — failure. The exact value is unspecified; backends should log details to stderr. Consumers treat any negative return as a generic backend failure and fall back to a degraded path (e.g. legacy save format) where possible.

Backends MUST NOT panic across the FFI boundary. Rust backends use std::panic::catch_unwind at every entry point and convert panics to a logged error plus a <0 return.

Reference implementations

  • Local FS — in-tree in the llama.cpp fork (tools/server/slot_v2.cpplocal_fs_store). Writes manifest at <dir>/<name> and chunks at <dir>/chunks/<hash[0]>/<hash>. 16-way subdirectory fanout. Atomic via tmp + rename. Default when --slot-save-path is a plain filesystem path (no ://).

  • MemKV — out-of-tree, shipped with the MemKV release as the kv-store-memkv crate. Cdylib libkv_store_memkv.{so,dylib}. URI: memkv://<host>:<port>/<namespace>. Auth via env MEMKV_AUTH_KEY. Buffered puts flushed via EXISTS + batch_put. Implements prefetch_chunks via batch_get.

Writing a new backend

Minimal C skeleton:

#include "kv_store_abi.h"
#include <stdlib.h>

struct kv_store_v1 {
    /* whatever the backend needs */
};

static kv_store_v1 *backend_open(const char *uri) {
    /* parse uri, connect, return handle (heap) */
}
static void backend_close(kv_store_v1 *s) { /* release */ }

static int backend_put_chunk(kv_store_v1 *s,
                             const uint8_t *h, size_t hl,
                             const uint8_t *d, size_t dl) {
    /* if hash already present: return 1 */
    /* else: store and return 0 */
}

static int backend_get_chunk(kv_store_v1 *s,
                             const uint8_t *h, size_t hl,
                             uint8_t **out, size_t *out_len) {
    void *buf = malloc(...);   /* must be malloc, not new/calloc */
    /* fill buf */
    *out = buf;
    *out_len = ...;
    return 0;
}

/* put_manifest, get_manifest, delete_manifest similarly */

static const kv_store_vtable VT = {
    .version = 1,    /* or 2 if prefetch_chunks is implemented */
    .open = backend_open,
    .close = backend_close,
    .put_chunk = backend_put_chunk,
    .get_chunk = backend_get_chunk,
    .put_manifest = backend_put_manifest,
    .get_manifest = backend_get_manifest,
    .delete_manifest = backend_delete_manifest,
    .prefetch_chunks = NULL,    /* or fill at v2 */
};

const kv_store_vtable *kv_store_get_vtable(void) { return &VT; }

Build as a shared library, expose kv_store_get_vtable, place the result on the consumer's loader path. No registration step, no configuration file — dlopen does the rest.

For Rust backends, see the kv-store-memkv crate in the MemKV repository as a working example: it exports the same vtable from a #[repr(C)] struct with static lifetime, wraps every entry point in std::panic::catch_unwind, and routes output buffers through a manual malloc for symmetry with the consumer's free.

Vendor checklist

Before shipping a backend, verify:

  • kv_store_get_vtable is the only exported symbol the consumer needs (everything else can be hidden).
  • The vtable's version is set to the highest level you implement; prefetch_chunks is NULL if you stayed at v1.
  • put_chunk is idempotent on duplicate hashes (returns 1, no side effects).
  • put_manifest is atomic against concurrent readers.
  • get_chunk and get_manifest allocate output via malloc so consumers can free.
  • Panics or unhandled exceptions cannot cross the FFI boundary.
  • The cdylib is named libkv_store_<scheme>.{so,dylib} exactly.
  • Auth credentials, region, and other secrets are configured via environment variables, not the URI (so they don't end up in process listings or log lines).

Why this exists

KV state offload is the single largest prefill latency win for long-context inference on workstation- and laptop-class hardware. Mac mini and Mac Studio hosts running llama.cpp hit the same wall as servers — fresh requests re-prefill the same 5–30k tokens — but the heavyweight transfer abstractions GPU-stack engines use don't fit.

This ABI is the smallest seam that lets a storage vendor ship one plugin an inference engine can dlopen: a vtable of eight function pointers in v2 (seven required, plus the optional prefetch_chunks hint), two namespaces, no opinions about the bytes. The reference consumer today is minio/llama.cpp on TCP-only hosts; the same plugin works for any future consumer that adopts the contract.

The same MemKV cluster serves both workstation deployments and larger fleets, so a system prompt populated on a developer's Mac mini and one populated server-side share the same chunk pool (modulo model-id matching).

References