MemKV
Operate

Configuration

Reference for the server config (/etc/memkv/config.yaml) and the client config consumed by the NIXL plugin and LD_PRELOAD shim via MEMKV_CONFIG.

You rarely need to handcraft the server config — memkv setup detects drives and NICs and emits a working config.yaml to stdout; hand-tune from there. See Quick Start. The client config (near the bottom of this page) is a separate yaml read by the NIXL plugin and LD_PRELOAD shim.

Minimal config

A handful of fields vary between deployments; everything else has a sane default. Only override when you have a reason (covered below).

network:
  address: 0.0.0.0:9900 # data plane; admin auto-binds at port + 1

memory:
  block_size: 2 MiB
  max_size: 247 GiB # RDMA-pinned RAM (size to your box)

rdma:
  device: mlx5_0

storage:
  mode: direct # direct | file | memory
  block_size: 4 MiB
  drives:
    - { media: /dev/nvme0n1 }
    - { media: /dev/nvme1n1 }
    - { media: /dev/nvme2n1 }

That's the whole working config for a typical bare-metal node. The sections below explain each field, when the default is right, and what to think about before changing it.

Network

network:
  address: 0.0.0.0:9900
  # tls_cert: /etc/memkv/tls/fullchain.pem
  # tls_key:  /etc/memkv/tls/privkey.pem
  # auth_key: <64 hex chars, 32 bytes>

address — The single listen address for the server. The TCP listener carries the RDMA bootstrap (Connect) and, for clients without an RDMA NIC, the inline-bulk codes (TcpPut / TcpGet / TcpDelete + Exists); RDMA-capable clients move steady-state control onto RC SEND/RECV after bootstrap. The admin HTTP(S) endpoint binds at the same IP, port + 1 (so 0.0.0.0:9901 when address is 0.0.0.0:9900). Change the bind IP to scope memkv to a specific NIC on multi-homed hosts; change the port if 9900/9901 conflicts.

tls_cert / tls_key — Set both to enable HTTPS on the admin endpoint. Use whenever the admin port is reachable beyond a trust boundary (cross-zone scrapes, ingress controllers). Setting only one is rejected at startup.

auth_key — HMAC-SHA256 shared key, 64 hex chars (32 bytes), used to authenticate every wire message between client and server. Required — the server refuses to start without one (loopback-only dev hosts can omit the key; the server derives a deterministic dev key from $HOME instead and logs a warning). Generate with openssl rand -hex 32. Can also be supplied via the MEMKV_AUTH_KEY env var (which overrides the yaml field if both are set); the Helm chart always sources it from a Kubernetes Secret. Clients must use the same key. Drift window between peers is ±60 s, so keep clocks in sync (chrony / ntpd).

Memory

memory:
  block_size: 2 MiB
  max_size: 247 GiB

Sizes accept human units — B, KiB, MB, MiB, GB, GiB, TB, TiB, with optional whitespace.

max_size — The single most important tuning knob. This is RDMA-pinned host memory used as the staging area for all client I/O. Size it to ~75–85% of free RAM on the box, after subtracting OS, page cache, and other co-resident services. The default of 1 GiB is for development only — production deployments need hundreds of gigabytes.

block_size — Allocator slab inside the pool. Default 2 MiB matches the typical RDMA WR size and rarely needs changing. Raise only if you measure internal fragmentation hurting cache locality on very large objects. The implied block-count cap is max_size / block_size.

RDMA

rdma:
  device: mlx5_0
  port: 1
  gid_index: 0 # 0 = auto-detect the routable RoCEv2 GID
  traffic_class: 0
  service_level: 0
  mtu: 4096
  cq_depth: 4096
  timeout: 14
  retry_count: 7
  num_dcis: 256
  dc_key: 0
  num_worker_threads: 64
  num_responder_threads: 16

These defaults are tuned for a typical LAN-attached RoCE/IB cluster. Most operators only set device.

device — RDMA NIC to bind to. mlx5_0 is the first Mellanox/NVIDIA HCA. Set this when you have multiple HCAs and want to pin memkv to a specific one (e.g., dedicating mlx5_0 to storage and mlx5_1 to a separate fabric). memkv setup auto-detects.

port — Physical port on the HCA. Leave at 1 unless you have a dual-port card (e.g., ConnectX-7) and want memkv on the second port. Port enumeration is per-HCA, not per-host.

gid_index — Selects which GID entry the QPs use. 0 auto-detects the routable RoCEv2 GID (the IPv4-mapped entry) from the NIC's GID table. This matters for lossless RoCE: GID index 0 on mlx5 is the RoCEv1 (Ethernet) entry, which has no IP header and therefore no DSCP, so a DSCP-trust fabric leaves it on the best-effort queue regardless of traffic_class. Auto-detect avoids that trap. Set a non-zero value to pin a specific GID entry (mixed RoCEv1/v2 environments, or VLAN-tagged subnets); if no RoCEv2 GID is found, it falls back to index 0 with a warning.

traffic_class — RoCEv2 traffic class (the IP DSCP/ToS byte stamped on egress). 0 auto-detects it from the NIC's DCB config: MemKV reads the lossless PFC priority and the DSCP mapped to it, so RDMA lands in the same lossless+ECN queue your fabric configured. Set a non-zero value (the ToS byte, i.e. DSCP << 2 — e.g. 104 for DSCP 26) to pin it explicitly and skip detection. If detection finds nothing the value stays 0 (best-effort) and a warning is logged. Also settable via MEMKV_RDMA_TRAFFIC_CLASS.

service_level — RoCEv2 service level (the AH SL, 0–7) stamped on egress alongside traffic_class. On mlx5 the egress scheduling class (which PFC queue the traffic uses, hence the lossless behaviour) follows the SL, not the DSCP — so the SL must match the lossless PFC priority or RDMA stays on the best-effort queue even with the right DSCP. 0 takes the priority auto-detected with traffic_class; set a non-zero value to pin it when traffic_class is also pinned. Also settable via MEMKV_RDMA_SERVICE_LEVEL.

mtu — RDMA path MTU. 4096 requires jumbo frames end-to-end (HCA + switch + peer). Drop to 1024 or 2048 only if your fabric can't carry 4K MTU — performance scales with MTU, so size up wherever the network allows. The value must not exceed the port's negotiated active MTU: a path MTU larger than the link drops every RDMA data packet, and MemKV refuses such a configuration at startup rather than stalling the data path. Set the same value on the client (MEMKV_RDMA_MTU) and the server, and configure the fabric to match — see the RoCEv2 setup runbook.

cq_depth — Completion queue size. Sized for num_dcis × inflight margin. Raise if you see "CQ overflow" in logs under burst load; otherwise leave alone.

timeout — Local ACK timeout per QP, expressed as 4.096µs × 2^timeout. 14 ≈ 67 ms — generous for LAN where lost packets are rare. Lower (e.g., 12 ≈ 16 ms) on ultra-low-latency fabrics for faster failure detection; raise on lossy or oversubscribed networks.

retry_count — Number of retransmissions before a connection is declared dead. 7 is the RDMA spec maximum. Drop to 34 if you want faster client failover; the default favors resilience over snappiness.

num_dcis — Size of the DC Initiator pool (mlx5 only). DCIs are acquired round-robin per client request. 256 saturates most NICs. Increase if you see DCI contention under very high concurrency.

dc_key — 24-bit shared key between server and client for DC fabric-level auth. 0 means no key check (open). Set to a deployment-specific value when you want fabric-level isolation between memkv clusters sharing the same RoCE network.

num_worker_threads — Parallelism for processing client writes/reads after RDMA arrival. Scale with available cores; if these threads peg CPU, raise — if they idle while QPs back up, the bottleneck is elsewhere.

num_responder_threads — Parallelism for posting response SENDs back to clients. Usually the smaller pool; raise only if SEND posting becomes a hot path.

Storage

storage:
  mode: direct # direct | file | memory
  block_size: 4 MiB
  flap_threshold: 5
  drives:
    - media: /dev/nvme0n1
      max_size: 2 TiB # cap usage to a subset of the physical drive
    - media: /dev/nvme1n1 # max_size omitted → use the full device
  # max_size: 16 GiB           # only used when mode: memory (no drives)

mode

  • direct (default): one block lives on one drive. Raw NVMe via O_DIRECT + io_uring. Fastest path, no parity overhead, no cross-drive redundancy. Linux only.
  • memory: RAM-only backend, no persistence. For development, benchmarks, and ephemeral caches. Capacity comes from the top-level storage.max_size.
  • file: mmap-backed regular files. No O_DIRECT, no NVMe ioctls — runs on any filesystem that supports mmap. Intended for single-node dev, CI, and the GPU-host co-located case where a spare NVMe isn't available. Bandwidth is bounded by the host filesystem; use direct for production. Each drives entry becomes one backing file, preallocated at max_size (default 4 GiB).

block_size — Object granularity on drive. 4 MiB balances metadata overhead (smaller = more index entries) against internal fragmentation (larger = more wasted space on sub-block writes). Drop to 1 MiB for many-small-objects workloads, raise to 816 MiB for streaming or large-object workloads.

flap_threshold — Failures tolerated before a drive is blacklisted and removed from the rotation. 5 is a reasonable middle ground. Lower (12) for paranoid fast-fail in production; raise for known-flaky drives you want to keep in service while you swap them. Ignored in file and memory modes.

drives[].media — Path to the NVMe device (mode: direct) or backing file (mode: file). For direct, raw block devices (/dev/nvmeXn1) — drives are formatted on first use, don't include OS drives. For file, regular file paths — files are created on first open at max_size.

drives[].max_size — Cap on capacity used by this drive. Omit to use the full physical capacity for direct, or the 4 GiB default for file. Useful when you want memkv to use only a subset of a large drive.

storage.max_size — Top-level capacity for mode: memory. Sizes the in-RAM keyspace. Ignored for direct and file.

Logging

logging:
  level: info # trace | debug | info | warn | error
  format: json # json | text

levelinfo is right for production. Drop to warn or error if log volume is a concern; raise to debug or trace only when debugging — trace is very chatty.

formatjson for log shippers (Loki, ELK, Datadog). text for tailing locally during development.

Client configuration (MEMKV_CONFIG)

The NIXL plugin (libplugin_MEMKV.so) and the LD_PRELOAD shim (libmemkv_preload.so) load their settings from memkv-client. The canonical layout is a yaml file referenced by the MEMKV_CONFIG env var, with MEMKV_* env vars layered on top as overrides.

Resolution order, lowest to highest priority:

  1. Built-in defaults (one local server, one rail, 256 MB cache, 8 conns)
  2. MEMKV_CONFIG yaml file, if the env var is set
  3. MEMKV_* env vars
# /etc/memkv/client.yaml — pointed at by `export MEMKV_CONFIG=...`
servers:
  - 10.0.0.1:9900
  - 10.0.0.2:9900
rdma_devices:
  - mlx5_0
  - mlx5_1
bind_addresses: # optional, one per server slot; ~ = bind any
  - 192.168.1.10
  - 192.168.1.11
cache_size_mb: 256
connect_timeout_ms: 5000
gid_index: 0
num_connections: 8
license: /etc/memkv/minio.license
auth_key: <64 hex chars, 32 bytes>
# transport: auto       # rdma | tcp; default auto

The client uses the configured server address verbatim — the TCP data port carries the entire wire protocol (control, RDMA bootstrap, and inline-bulk codes for hosts without RDMA). There's no separate transport target; the client picks how to use it.

A complete reference copy ships at deploy/examples/client-config.yaml.

Field-by-field

FieldTypeDefaultOverride env varNotes
servers[host:port, …]127.0.0.1:9900MEMKV_SERVERSData-plane endpoints. Setting MEMKV_SERVERS resets bind_addresses to per-server None; set MEMKV_BIND_ADDRESSES (separately) to pin source IPs after the override.
rdma_devices[device, …]mlx5_0MEMKV_RDMA_DEVICES / MEMKV_RDMA_DEVICEOne device per server slot; MEMKV_RDMA_DEVICE (singular) is the single-rail alias.
bind_addresses[ip-or-null, …][~]MEMKV_BIND_ADDRESSESSource IP per server. ~ (yaml null) or empty env entry means "bind any". Length must match servers.
cache_size_mbusize256MEMKV_CACHE_SIZE_MBPer-engine MR cache.
connect_timeout_msu645000MEMKV_CONNECT_TIMEOUT_MSInitial TCP connect timeout per server.
gid_indexu80RDMA GID index. No env var; set in yaml or accept the default.
traffic_classu80MEMKV_RDMA_TRAFFIC_CLASSRoCEv2 traffic class (ToS byte). 0 auto-detects the lossless DSCP from the NIC's DCB config; non-zero pins it (DSCP << 2, e.g. 104 for DSCP 26).
service_levelu80MEMKV_RDMA_SERVICE_LEVELRoCEv2 service level (AH SL). On mlx5 the egress scheduling class follows the SL, so it must match the lossless priority. 0 takes the priority auto-detected with traffic_class.
mtuu324096MEMKV_RDMA_MTURDMA path MTU (256/512/1024/2048/4096). Must match the server's rdma.mtu and not exceed the port's active MTU; startup is refused otherwise. See the RoCEv2 runbook.
num_connectionsusize8MEMKV_NUM_CONNECTIONSPer-rail connection-pool size, clamped to >= 1.
licensestring (JWT | path)MEMKV_LICENSEHighest-priority license source. Inline JWT or path to a license file. See License.
auth_keystring (hex)MEMKV_AUTH_KEYRequired. 64 hex chars (32 bytes). Must match the server's network.auth_key. See Authentication.
transportauto | rdma | tcpauto (Linux) / tcp (macOS)MEMKV_TRANSPORTWhich transport the client uses. Defaults to tcp on macOS / non-Linux because the RDMA stack isn't compiled in there. See Transport selection.

Transport selection

memkv-client speaks two data transports — RDMA and TCP — and both are fully supported. RDMA delivers the lowest latency where an HCA and a clean fabric are available; TCP is a first-class high-throughput path, not a degraded shim, for the many environments where RDMA isn't viable (routed or multi-hop networks, cloud, mixed NICs). transport controls which one the client uses:

  • auto (default): try RDMA on the configured rdma_devices first, fall back to TCP if no RDMA NIC is present at boot or any in-flight RDMA op fails. The fallback latches once — subsequent requests go straight to TCP. macOS / hosts without an HCA land in TCP from startup. The downgrade is logged at WARN.
  • rdma: strict. The client errors at startup if rdma_devices is empty or the build is non-Linux (macOS / arm64 dev box). In-flight RDMA failures propagate instead of silently switching transports — you get a clear error pointing at the misconfiguration. Use this in production where RDMA must be the actual data path.
  • tcp: skip RDMA entirely; no QP setup. The configured server address is the TCP target directly. To sustain throughput on fast links the client opens num_connections sockets per server and pipelines batched reads and writes across them, so TCP is a production-viable transport — not only for co-located dev hosts (e.g. Mac Mini) but for any deployment where RDMA can't reach end to end.

The server's TCP listener carries the RDMA bootstrap (Connect) and, for clients without an RDMA NIC, the inline-bulk codes (TcpPut / TcpGet / TcpDelete + Exists). RDMA-capable clients move every other control message onto RC SEND/RECV after bootstrap. Transport choice is the client's decision.

Port convention. The server binds the data port as TCP; clients use the configured host:port directly. The admin endpoint auto-binds at port + 1 with the same host IP — no separate knob.

Authentication

Every wire message between client and server is signed with HMAC-SHA256 over (header || ts_ns) and verified on receipt. The shared key is the auth_key field above (or the MEMKV_AUTH_KEY env override). Drift window is ±60 s — keep clocks in sync with chrony / ntpd. There is no unauthenticated mode; the client refuses to start without a key, and the server refuses to start without one either (loopback-only deployments are an exception: both sides derive a deterministic dev key from $HOME).

Generate a key with openssl rand -hex 32. The same value goes on both sides — server (network.auth_key in /etc/memkv/config.yaml or MEMKV_AUTH_KEY) and every client (yaml auth_key: here, or MEMKV_AUTH_KEY env). Keys never appear in logs.

License

memkv-client verifies a license at startup. The lookup chain is the yaml license: field → MEMKV_LICENSEAISTOR_LICENSEMINIO_LICENSE./minio.license$HOME/.<binary_name>/minio.license.

Accepted plans: Free, Enterprise, EnterpriseLite, EnterprisePlus.

Getting a license. All MemKV licenses — Free and paid — are issued by MinIO SUBNET. Visit min.io/pricing to request a Free-tier license for evaluation or non-production use, or to start an Enterprise / EnterpriseLite / EnterprisePlus conversation with the MinIO team. The license file you receive is a JWT; point license: (or MEMKV_LICENSE) at either the inline token or the file path.

Free tier limits. The engine refuses to start under a Free license outside these bounds:

  • A single entry in servers. Scale-out across servers requires Enterprise / EnterpriseLite / EnterprisePlus.
  • A single entry in storage.drives. memory mode is exempt (no drives semantics).
  • ≤ 32 TiB total capacity. For file and memory modes this is checked against the configured drives[].max_size / storage.max_size at boot. For direct it's checked against the physical NVMe capacity once the drives are opened. Either failure surfaces as a clear startup error.

Environment variables

A flat index of every MEMKV_* variable. Client resolution is defaults → MEMKV_CONFIG yaml → env vars (env wins). The server reads /etc/memkv/config.yaml plus the two overrides below.

Server

VariableTypeDefaultNotes
MEMKV_AUTH_KEYhex (64 chars)Overrides network.auth_key from yaml. Loopback-only dev hosts may omit it; the server then derives a deterministic dev key from $HOME and warns.
MEMKV_LICENSEJWT | pathServer license-lookup chain. See License. Free tier is limited to a single remote server.

AISTOR_LICENSE is consulted next, then MINIO_LICENSE for compatibility with the broader MinIO stack — both are non-memkv-specific.

LD_PRELOAD shim

The client variables (MEMKV_SERVERS, MEMKV_AUTH_KEY, MEMKV_TRANSPORT, MEMKV_LICENSE, MEMKV_RDMA_DEVICES, MEMKV_BIND_ADDRESSES, MEMKV_CACHE_SIZE_MB, MEMKV_CONNECT_TIMEOUT_MS, MEMKV_NUM_CONNECTIONS, MEMKV_RDMA_TRAFFIC_CLASS, MEMKV_RDMA_SERVICE_LEVEL) and their yaml counterparts live in the Client configuration table above. One extra variable applies only to the LD_PRELOAD shim:

VariableTypeDefaultNotes
MEMKV_CACHE_DIR_PREFIXpathunsetLD_PRELOAD shim only. File I/O at or under this path is rerouted into a memkv cluster instead of hitting the local filesystem. With the var unset (or empty) the shim is a no-op. Match is exact or strictly under (no /foo matching /foo-other).

memkv admin

VariableTypeDefaultNotes
MEMKV_SERVERShost:port[,host:port,…]http://127.0.0.1:9901Admin endpoints, comma-separated. Falls back to the default when neither --servers nor the env is set. See CLI reference.

Advanced tuning

Defaults are tuned for ConnectX-7 / mlx5 hardware on Linux and are the right answer for almost everyone. Override these only when you have measured a specific bottleneck — bad values cause silent latency or throughput regressions, not loud errors.

VariableTypeDefaultScopeNotes
MEMKV_RDMA_TIMEOUT_SECSu64 (sec)30clientPer-op timeout for RDMA send-CQ waits. Hitting this marks the connection unhealthy and propagates an error.
MEMKV_CQ_BATCH_SPINSu3264clientNumber of busy-poll iterations on the CQ before backing off to a sleep. Higher values trade CPU for latency.
MEMKV_CQ_POLL_DELAY_NSu64 (ns)5000clientSleep between poll batches once MEMKV_CQ_BATCH_SPINS is exhausted. 0 keeps the thread spinning indefinitely (only sane on a dedicated core).
MEMKV_MAX_BATCH_SIZEusize64clientOps per BatchRead / BatchWrite control message. Clamped to >= 1. See KV cache sizing.
MEMKV_STAGING_SIZE_MBusize (MiB)256clientTotal staging-buffer pool used for batched writes, per client. Raise if you see batch staging slot exhausted warnings under sustained burst load. With TP=N, the per-host commitment is N × this value.
MEMKV_STAGING_SLOT_MBusize (MiB)16clientPer-payload slot size inside the staging pool. Cap on the largest single payload that can ride a batched write without falling back to direct.

These are read once on first use and cached for the lifetime of the process — changing them at runtime has no effect.

Observability

Two client-side env vars enable optional telemetry exporters. They are also documented in Monitoring.

VariableTypeDefaultScopeNotes
MEMKV_PROMETHEUS_BINDhost:portunsetclientBind a Prometheus exposition endpoint inside the client. Off by default; useful inside the plugin host process when no separate exporter runs.
MEMKV_OTEL_ENDPOINTURLunsetclientOTLP endpoint for client-side traces and metrics. Off by default.

Debug

VariableTypeDefaultScopeNotes
MEMKV_SLAB_DOUBLE_FREE_TRACE1 | true | TRUE | yesunsetserverCapture a backtrace on slab double-free events in the server's RDMA memory pool. Caps at 8 traces total to keep logs sane. Off by default — only enable when investigating a suspected double-free.