Configuration
Reference for the server config (/etc/memkv/config.yaml) and the client config consumed by the NIXL plugin and LD_PRELOAD shim via MEMKV_CONFIG.
You rarely need to handcraft the server config — memkv setup detects drives
and NICs and emits a working config.yaml to stdout; hand-tune from there.
See Quick Start. The client config (near the
bottom of this page) is a separate yaml read by the NIXL plugin and
LD_PRELOAD shim.
Minimal config
A handful of fields vary between deployments; everything else has a sane default. Only override when you have a reason (covered below).
network:
address: 0.0.0.0:9900 # data plane; admin auto-binds at port + 1
memory:
block_size: 2 MiB
max_size: 247 GiB # RDMA-pinned RAM (size to your box)
rdma:
device: mlx5_0
storage:
mode: direct # direct | file | memory
block_size: 4 MiB
drives:
- { media: /dev/nvme0n1 }
- { media: /dev/nvme1n1 }
- { media: /dev/nvme2n1 }That's the whole working config for a typical bare-metal node. The sections below explain each field, when the default is right, and what to think about before changing it.
Network
network:
address: 0.0.0.0:9900
# tls_cert: /etc/memkv/tls/fullchain.pem
# tls_key: /etc/memkv/tls/privkey.pem
# auth_key: <64 hex chars, 32 bytes>address — The single listen address for the server. The TCP listener carries the RDMA bootstrap (Connect) and, for clients without an RDMA NIC, the inline-bulk codes (TcpPut / TcpGet / TcpDelete + Exists); RDMA-capable clients move steady-state control onto RC SEND/RECV after bootstrap. The admin HTTP(S) endpoint binds at the same IP, port + 1 (so 0.0.0.0:9901 when address is 0.0.0.0:9900). Change the bind IP to scope memkv to a specific NIC on multi-homed hosts; change the port if 9900/9901 conflicts.
tls_cert / tls_key — Set both to enable HTTPS on the admin endpoint. Use whenever the admin port is reachable beyond a trust boundary (cross-zone scrapes, ingress controllers). Setting only one is rejected at startup.
auth_key — HMAC-SHA256 shared key, 64 hex chars (32 bytes), used to authenticate every wire message between client and server. Required — the server refuses to start without one (loopback-only dev hosts can omit the key; the server derives a deterministic dev key from $HOME instead and logs a warning). Generate with openssl rand -hex 32. Can also be supplied via the MEMKV_AUTH_KEY env var (which overrides the yaml field if both are set); the Helm chart always sources it from a Kubernetes Secret. Clients must use the same key. Drift window between peers is ±60 s, so keep clocks in sync (chrony / ntpd).
Memory
memory:
block_size: 2 MiB
max_size: 247 GiBSizes accept human units — B, KiB, MB, MiB, GB, GiB, TB, TiB, with optional whitespace.
max_size — The single most important tuning knob. This is RDMA-pinned host memory used as the staging area for all client I/O. Size it to ~75–85% of free RAM on the box, after subtracting OS, page cache, and other co-resident services. The default of 1 GiB is for development only — production deployments need hundreds of gigabytes.
block_size — Allocator slab inside the pool. Default 2 MiB matches the typical RDMA WR size and rarely needs changing. Raise only if you measure internal fragmentation hurting cache locality on very large objects. The implied block-count cap is max_size / block_size.
RDMA
rdma:
device: mlx5_0
port: 1
gid_index: 0 # 0 = auto-detect the routable RoCEv2 GID
traffic_class: 0
service_level: 0
mtu: 4096
cq_depth: 4096
timeout: 14
retry_count: 7
num_dcis: 256
dc_key: 0
num_worker_threads: 64
num_responder_threads: 16These defaults are tuned for a typical LAN-attached RoCE/IB cluster. Most operators only set device.
device — RDMA NIC to bind to. mlx5_0 is the first Mellanox/NVIDIA HCA. Set this when you have multiple HCAs and want to pin memkv to a specific one (e.g., dedicating mlx5_0 to storage and mlx5_1 to a separate fabric). memkv setup auto-detects.
port — Physical port on the HCA. Leave at 1 unless you have a dual-port card (e.g., ConnectX-7) and want memkv on the second port. Port enumeration is per-HCA, not per-host.
gid_index — Selects which GID entry the QPs use. 0 auto-detects the routable RoCEv2 GID (the IPv4-mapped entry) from the NIC's GID table. This matters for lossless RoCE: GID index 0 on mlx5 is the RoCEv1 (Ethernet) entry, which has no IP header and therefore no DSCP, so a DSCP-trust fabric leaves it on the best-effort queue regardless of traffic_class. Auto-detect avoids that trap. Set a non-zero value to pin a specific GID entry (mixed RoCEv1/v2 environments, or VLAN-tagged subnets); if no RoCEv2 GID is found, it falls back to index 0 with a warning.
traffic_class — RoCEv2 traffic class (the IP DSCP/ToS byte stamped on egress). 0 auto-detects it from the NIC's DCB config: MemKV reads the lossless PFC priority and the DSCP mapped to it, so RDMA lands in the same lossless+ECN queue your fabric configured. Set a non-zero value (the ToS byte, i.e. DSCP << 2 — e.g. 104 for DSCP 26) to pin it explicitly and skip detection. If detection finds nothing the value stays 0 (best-effort) and a warning is logged. Also settable via MEMKV_RDMA_TRAFFIC_CLASS.
service_level — RoCEv2 service level (the AH SL, 0–7) stamped on egress alongside traffic_class. On mlx5 the egress scheduling class (which PFC queue the traffic uses, hence the lossless behaviour) follows the SL, not the DSCP — so the SL must match the lossless PFC priority or RDMA stays on the best-effort queue even with the right DSCP. 0 takes the priority auto-detected with traffic_class; set a non-zero value to pin it when traffic_class is also pinned. Also settable via MEMKV_RDMA_SERVICE_LEVEL.
mtu — RDMA path MTU. 4096 requires jumbo frames end-to-end (HCA + switch + peer). Drop to 1024 or 2048 only if your fabric can't carry 4K MTU — performance scales with MTU, so size up wherever the network allows. The value must not exceed the port's negotiated active MTU: a path MTU larger than the link drops every RDMA data packet, and MemKV refuses such a configuration at startup rather than stalling the data path. Set the same value on the client (MEMKV_RDMA_MTU) and the server, and configure the fabric to match — see the RoCEv2 setup runbook.
cq_depth — Completion queue size. Sized for num_dcis × inflight margin. Raise if you see "CQ overflow" in logs under burst load; otherwise leave alone.
timeout — Local ACK timeout per QP, expressed as 4.096µs × 2^timeout. 14 ≈ 67 ms — generous for LAN where lost packets are rare. Lower (e.g., 12 ≈ 16 ms) on ultra-low-latency fabrics for faster failure detection; raise on lossy or oversubscribed networks.
retry_count — Number of retransmissions before a connection is declared dead. 7 is the RDMA spec maximum. Drop to 3–4 if you want faster client failover; the default favors resilience over snappiness.
num_dcis — Size of the DC Initiator pool (mlx5 only). DCIs are acquired round-robin per client request. 256 saturates most NICs. Increase if you see DCI contention under very high concurrency.
dc_key — 24-bit shared key between server and client for DC fabric-level auth. 0 means no key check (open). Set to a deployment-specific value when you want fabric-level isolation between memkv clusters sharing the same RoCE network.
num_worker_threads — Parallelism for processing client writes/reads after RDMA arrival. Scale with available cores; if these threads peg CPU, raise — if they idle while QPs back up, the bottleneck is elsewhere.
num_responder_threads — Parallelism for posting response SENDs back to clients. Usually the smaller pool; raise only if SEND posting becomes a hot path.
Storage
storage:
mode: direct # direct | file | memory
block_size: 4 MiB
flap_threshold: 5
drives:
- media: /dev/nvme0n1
max_size: 2 TiB # cap usage to a subset of the physical drive
- media: /dev/nvme1n1 # max_size omitted → use the full device
# max_size: 16 GiB # only used when mode: memory (no drives)mode —
direct(default): one block lives on one drive. Raw NVMe viaO_DIRECT+io_uring. Fastest path, no parity overhead, no cross-drive redundancy. Linux only.memory: RAM-only backend, no persistence. For development, benchmarks, and ephemeral caches. Capacity comes from the top-levelstorage.max_size.file: mmap-backed regular files. NoO_DIRECT, no NVMe ioctls — runs on any filesystem that supportsmmap. Intended for single-node dev, CI, and the GPU-host co-located case where a spare NVMe isn't available. Bandwidth is bounded by the host filesystem; usedirectfor production. Eachdrivesentry becomes one backing file, preallocated atmax_size(default 4 GiB).
block_size — Object granularity on drive. 4 MiB balances metadata overhead (smaller = more index entries) against internal fragmentation (larger = more wasted space on sub-block writes). Drop to 1 MiB for many-small-objects workloads, raise to 8–16 MiB for streaming or large-object workloads.
flap_threshold — Failures tolerated before a drive is blacklisted and removed from the rotation. 5 is a reasonable middle ground. Lower (1–2) for paranoid fast-fail in production; raise for known-flaky drives you want to keep in service while you swap them. Ignored in file and memory modes.
drives[].media — Path to the NVMe device (mode: direct) or backing file (mode: file). For direct, raw block devices (/dev/nvmeXn1) — drives are formatted on first use, don't include OS drives. For file, regular file paths — files are created on first open at max_size.
drives[].max_size — Cap on capacity used by this drive. Omit to use the full physical capacity for direct, or the 4 GiB default for file. Useful when you want memkv to use only a subset of a large drive.
storage.max_size — Top-level capacity for mode: memory. Sizes the in-RAM keyspace. Ignored for direct and file.
Logging
logging:
level: info # trace | debug | info | warn | error
format: json # json | textlevel — info is right for production. Drop to warn or error if log volume is a concern; raise to debug or trace only when debugging — trace is very chatty.
format — json for log shippers (Loki, ELK, Datadog). text for tailing locally during development.
Client configuration (MEMKV_CONFIG)
The NIXL plugin (libplugin_MEMKV.so) and the LD_PRELOAD shim
(libmemkv_preload.so) load their settings from memkv-client. The
canonical layout is a yaml file referenced by the MEMKV_CONFIG env
var, with MEMKV_* env vars layered on top as overrides.
Resolution order, lowest to highest priority:
- Built-in defaults (one local server, one rail, 256 MB cache, 8 conns)
MEMKV_CONFIGyaml file, if the env var is setMEMKV_*env vars
# /etc/memkv/client.yaml — pointed at by `export MEMKV_CONFIG=...`
servers:
- 10.0.0.1:9900
- 10.0.0.2:9900
rdma_devices:
- mlx5_0
- mlx5_1
bind_addresses: # optional, one per server slot; ~ = bind any
- 192.168.1.10
- 192.168.1.11
cache_size_mb: 256
connect_timeout_ms: 5000
gid_index: 0
num_connections: 8
license: /etc/memkv/minio.license
auth_key: <64 hex chars, 32 bytes>
# transport: auto # rdma | tcp; default autoThe client uses the configured server address verbatim — the TCP data port carries the entire wire protocol (control, RDMA bootstrap, and inline-bulk codes for hosts without RDMA). There's no separate transport target; the client picks how to use it.
A complete reference copy ships at
deploy/examples/client-config.yaml.
Field-by-field
| Field | Type | Default | Override env var | Notes |
|---|---|---|---|---|
servers | [host:port, …] | 127.0.0.1:9900 | MEMKV_SERVERS | Data-plane endpoints. Setting MEMKV_SERVERS resets bind_addresses to per-server None; set MEMKV_BIND_ADDRESSES (separately) to pin source IPs after the override. |
rdma_devices | [device, …] | mlx5_0 | MEMKV_RDMA_DEVICES / MEMKV_RDMA_DEVICE | One device per server slot; MEMKV_RDMA_DEVICE (singular) is the single-rail alias. |
bind_addresses | [ip-or-null, …] | [~] | MEMKV_BIND_ADDRESSES | Source IP per server. ~ (yaml null) or empty env entry means "bind any". Length must match servers. |
cache_size_mb | usize | 256 | MEMKV_CACHE_SIZE_MB | Per-engine MR cache. |
connect_timeout_ms | u64 | 5000 | MEMKV_CONNECT_TIMEOUT_MS | Initial TCP connect timeout per server. |
gid_index | u8 | 0 | — | RDMA GID index. No env var; set in yaml or accept the default. |
traffic_class | u8 | 0 | MEMKV_RDMA_TRAFFIC_CLASS | RoCEv2 traffic class (ToS byte). 0 auto-detects the lossless DSCP from the NIC's DCB config; non-zero pins it (DSCP << 2, e.g. 104 for DSCP 26). |
service_level | u8 | 0 | MEMKV_RDMA_SERVICE_LEVEL | RoCEv2 service level (AH SL). On mlx5 the egress scheduling class follows the SL, so it must match the lossless priority. 0 takes the priority auto-detected with traffic_class. |
mtu | u32 | 4096 | MEMKV_RDMA_MTU | RDMA path MTU (256/512/1024/2048/4096). Must match the server's rdma.mtu and not exceed the port's active MTU; startup is refused otherwise. See the RoCEv2 runbook. |
num_connections | usize | 8 | MEMKV_NUM_CONNECTIONS | Per-rail connection-pool size, clamped to >= 1. |
license | string (JWT | path) | — | MEMKV_LICENSE | Highest-priority license source. Inline JWT or path to a license file. See License. |
auth_key | string (hex) | — | MEMKV_AUTH_KEY | Required. 64 hex chars (32 bytes). Must match the server's network.auth_key. See Authentication. |
transport | auto | rdma | tcp | auto (Linux) / tcp (macOS) | MEMKV_TRANSPORT | Which transport the client uses. Defaults to tcp on macOS / non-Linux because the RDMA stack isn't compiled in there. See Transport selection. |
Transport selection
memkv-client speaks two data transports — RDMA and TCP — and both
are fully supported. RDMA delivers the lowest latency where an HCA
and a clean fabric are available; TCP is a first-class high-throughput
path, not a degraded shim, for the many environments where RDMA isn't
viable (routed or multi-hop networks, cloud, mixed NICs). transport
controls which one the client uses:
auto(default): try RDMA on the configuredrdma_devicesfirst, fall back to TCP if no RDMA NIC is present at boot or any in-flight RDMA op fails. The fallback latches once — subsequent requests go straight to TCP. macOS / hosts without an HCA land in TCP from startup. The downgrade is logged at WARN.rdma: strict. The client errors at startup ifrdma_devicesis empty or the build is non-Linux (macOS / arm64 dev box). In-flight RDMA failures propagate instead of silently switching transports — you get a clear error pointing at the misconfiguration. Use this in production where RDMA must be the actual data path.tcp: skip RDMA entirely; no QP setup. The configured server address is the TCP target directly. To sustain throughput on fast links the client opensnum_connectionssockets per server and pipelines batched reads and writes across them, so TCP is a production-viable transport — not only for co-located dev hosts (e.g. Mac Mini) but for any deployment where RDMA can't reach end to end.
The server's TCP listener carries the RDMA bootstrap (Connect)
and, for clients without an RDMA NIC, the inline-bulk codes
(TcpPut / TcpGet / TcpDelete + Exists). RDMA-capable
clients move every other control message onto RC SEND/RECV after
bootstrap. Transport choice is the client's decision.
Port convention. The server binds the data port as TCP; clients
use the configured host:port directly. The admin endpoint
auto-binds at port + 1 with the same host IP — no separate knob.
Authentication
Every wire message between client and server is signed with
HMAC-SHA256 over (header || ts_ns) and verified on receipt. The shared key is the auth_key field above (or
the MEMKV_AUTH_KEY env override). Drift window is ±60 s — keep
clocks in sync with chrony / ntpd. There is no unauthenticated mode;
the client refuses to start without a key, and the server refuses to
start without one either (loopback-only deployments are an exception:
both sides derive a deterministic dev key from $HOME).
Generate a key with openssl rand -hex 32. The same value goes on
both sides — server (network.auth_key in /etc/memkv/config.yaml or
MEMKV_AUTH_KEY) and every client (yaml auth_key: here, or
MEMKV_AUTH_KEY env). Keys never appear in logs.
License
memkv-client verifies a license at startup. The lookup chain is
the yaml license: field → MEMKV_LICENSE → AISTOR_LICENSE →
MINIO_LICENSE → ./minio.license → $HOME/.<binary_name>/minio.license.
Accepted plans: Free, Enterprise, EnterpriseLite, EnterprisePlus.
Getting a license. All MemKV licenses — Free and paid — are issued
by MinIO SUBNET. Visit min.io/pricing to
request a Free-tier license for evaluation or non-production use, or
to start an Enterprise / EnterpriseLite / EnterprisePlus conversation
with the MinIO team. The license file you receive is a JWT; point
license: (or MEMKV_LICENSE) at either the inline token or the
file path.
Free tier limits. The engine refuses to start under a Free license outside these bounds:
- A single entry in
servers. Scale-out across servers requires Enterprise / EnterpriseLite / EnterprisePlus. - A single entry in
storage.drives. memory mode is exempt (nodrivessemantics). - ≤ 32 TiB total capacity. For
fileandmemorymodes this is checked against the configureddrives[].max_size/storage.max_sizeat boot. Fordirectit's checked against the physical NVMe capacity once the drives are opened. Either failure surfaces as a clear startup error.
Environment variables
A flat index of every MEMKV_* variable. Client resolution is defaults → MEMKV_CONFIG yaml → env vars (env wins). The server reads /etc/memkv/config.yaml plus the two overrides below.
Server
| Variable | Type | Default | Notes |
|---|---|---|---|
MEMKV_AUTH_KEY | hex (64 chars) | — | Overrides network.auth_key from yaml. Loopback-only dev hosts may omit it; the server then derives a deterministic dev key from $HOME and warns. |
MEMKV_LICENSE | JWT | path | — | Server license-lookup chain. See License. Free tier is limited to a single remote server. |
AISTOR_LICENSE is consulted next, then MINIO_LICENSE for compatibility with the broader MinIO stack — both are non-memkv-specific.
LD_PRELOAD shim
The client variables (MEMKV_SERVERS, MEMKV_AUTH_KEY, MEMKV_TRANSPORT, MEMKV_LICENSE, MEMKV_RDMA_DEVICES, MEMKV_BIND_ADDRESSES, MEMKV_CACHE_SIZE_MB, MEMKV_CONNECT_TIMEOUT_MS, MEMKV_NUM_CONNECTIONS, MEMKV_RDMA_TRAFFIC_CLASS, MEMKV_RDMA_SERVICE_LEVEL) and their yaml counterparts live in the Client configuration table above. One extra variable applies only to the LD_PRELOAD shim:
| Variable | Type | Default | Notes |
|---|---|---|---|
MEMKV_CACHE_DIR_PREFIX | path | unset | LD_PRELOAD shim only. File I/O at or under this path is rerouted into a memkv cluster instead of hitting the local filesystem. With the var unset (or empty) the shim is a no-op. Match is exact or strictly under (no /foo matching /foo-other). |
memkv admin
| Variable | Type | Default | Notes |
|---|---|---|---|
MEMKV_SERVERS | host:port[,host:port,…] | http://127.0.0.1:9901 | Admin endpoints, comma-separated. Falls back to the default when neither --servers nor the env is set. See CLI reference. |
Advanced tuning
Defaults are tuned for ConnectX-7 / mlx5 hardware on Linux and are the right answer for almost everyone. Override these only when you have measured a specific bottleneck — bad values cause silent latency or throughput regressions, not loud errors.
| Variable | Type | Default | Scope | Notes |
|---|---|---|---|---|
MEMKV_RDMA_TIMEOUT_SECS | u64 (sec) | 30 | client | Per-op timeout for RDMA send-CQ waits. Hitting this marks the connection unhealthy and propagates an error. |
MEMKV_CQ_BATCH_SPINS | u32 | 64 | client | Number of busy-poll iterations on the CQ before backing off to a sleep. Higher values trade CPU for latency. |
MEMKV_CQ_POLL_DELAY_NS | u64 (ns) | 5000 | client | Sleep between poll batches once MEMKV_CQ_BATCH_SPINS is exhausted. 0 keeps the thread spinning indefinitely (only sane on a dedicated core). |
MEMKV_MAX_BATCH_SIZE | usize | 64 | client | Ops per BatchRead / BatchWrite control message. Clamped to >= 1. See KV cache sizing. |
MEMKV_STAGING_SIZE_MB | usize (MiB) | 256 | client | Total staging-buffer pool used for batched writes, per client. Raise if you see batch staging slot exhausted warnings under sustained burst load. With TP=N, the per-host commitment is N × this value. |
MEMKV_STAGING_SLOT_MB | usize (MiB) | 16 | client | Per-payload slot size inside the staging pool. Cap on the largest single payload that can ride a batched write without falling back to direct. |
These are read once on first use and cached for the lifetime of the process — changing them at runtime has no effect.
Observability
Two client-side env vars enable optional telemetry exporters. They are also documented in Monitoring.
| Variable | Type | Default | Scope | Notes |
|---|---|---|---|---|
MEMKV_PROMETHEUS_BIND | host:port | unset | client | Bind a Prometheus exposition endpoint inside the client. Off by default; useful inside the plugin host process when no separate exporter runs. |
MEMKV_OTEL_ENDPOINT | URL | unset | client | OTLP endpoint for client-side traces and metrics. Off by default. |
Debug
| Variable | Type | Default | Scope | Notes |
|---|---|---|---|---|
MEMKV_SLAB_DOUBLE_FREE_TRACE | 1 | true | TRUE | yes | unset | server | Capture a backtrace on slab double-free events in the server's RDMA memory pool. Caps at 8 traces total to keep logs sane. Off by default — only enable when investigating a suspected double-free. |
CLI Reference
Complete reference for the memkv command-line interface — every subcommand, flag, default, and exit-code semantics.
RoCEv2 setup runbook
End-to-end setup for lossless RoCEv2 — switch PFC/ECN/DSCP, host DCB, jumbo-frame MTU, and the MemKV settings that match them — plus verification and troubleshooting.