MemKV
Operate

Monitoring

Health endpoints, Prometheus metrics, status API, and Kubernetes probes for MemKV.

MemKV exposes health endpoints and Prometheus metrics on the admin HTTP endpoint (default: 0.0.0.0:9901). The client library ships separate exporters that are off by default.

Client-side exporters

memkv-client (the NIXL plugin and LD_PRELOAD shim) emits its own metrics — request latencies, transport state, batch sizing. Pick a sink by setting one of these env vars; with neither set, observability is disabled and no work is done in the hot path.

VariableTypeDefaultNotes
MEMKV_OTEL_ENDPOINTURLunsetOTLP/gRPC collector to ship spans + metrics to (e.g. http://otel-collector:4317). Falls back to OTEL_EXPORTER_OTLP_ENDPOINT. Requires the client built with the otel feature.
MEMKV_PROMETHEUS_BINDhost:portunsetBind address for an in-process Prometheus scrape endpoint at /metrics. Use when the client runs inside a long-lived process you can scrape directly. Requires the prometheus feature.

If both are set the OTLP path wins. The exporter is installed once per process — subsequent calls are no-ops.

Health Endpoints

All admin endpoints are versioned under /v1/.

EndpointPurposeResponse
GET /v1/healthLiveness probeJSON {status, rdma_active, drives_online, drives_offline, drives_total} (macOS: {status} only)
GET /v1/readyReadiness probeok (200)
GET /v1/statusDetailed statusJSON with memory, storage, RDMA stats
GET /v1/metricsPrometheus metricstext/plain Prometheus format
curl http://localhost:9901/v1/health
# {"status":"healthy","rdma_active":true,"drives_online":12,"drives_offline":0,"drives_total":12}

status is one of healthy, degraded, or unhealthy. HTTP status code mirrors the body — 200 for healthy, 503 for unhealthy.

memkv admin status --json mirrors the full /v1/status shape below, including the storage capacity totals (raw_bytes_total, usable_bytes_total, used_bytes_total) and the per-device byte fields. The plain memkv admin status table surfaces storage fill as used / usable plus a percentage.

Status Endpoint

curl http://localhost:9901/v1/status
{
  "memory": {
    "used_bytes": 1073741824,
    "total_bytes": 265289728000,
    "slabs_allocated": 512,
    "slabs_total": 126464
  },
  "storage": {
    "num_devices": 24,
    "block_size_bytes": 4194304,
    "total_blocks": 1000,
    "raw_bytes_total": 384000000000000,
    "usable_bytes_total": 383979687936000,
    "used_bytes_total": 176160768,
    "devices": [
      {
        "device_id": 0,
        "path": "/dev/nvme0n1",
        "total_slots": 3815461,
        "allocated_slots": 42,
        "raw_capacity_bytes": 16000000000000,
        "usable_bytes": 15999153635328,
        "metadata_bytes": 846364672,
        "used_bytes": 176160768
      }
    ]
  },
  "rdma": {
    "active_connections": 8
  },
  "blocks_total": 1000,
  "blocks_persisted": 950,
  "uptime_seconds": 3612
}

Prometheus Metrics

Request Operations

MetricTypeDescription
memkv_allocate_totalCounterTotal allocate requests
memkv_lookup_totalCounterTotal lookup requests
memkv_delete_totalCounterTotal delete requests
memkv_commit_totalCounterTotal commit requests
memkv_read_totalCounterTotal direct-IO read requests
memkv_read_bytes_totalCounterTotal bytes served on direct-IO reads
memkv_write_totalCounterTotal direct-IO write requests
memkv_write_bytes_totalCounterTotal bytes persisted on direct-IO writes
memkv_connect_totalCounterTotal RDMA connect requests
memkv_exists_totalCounterTotal exists requests
memkv_not_found_totalCounterTotal not-found responses
memkv_no_space_totalCounterTotal no-space responses
memkv_invalid_requests_totalCounterTotal invalid/malformed requests
memkv_request_latency_usHistogramEnd-to-end request latency

RDMA Transport

MetricTypeDescription
memkv_rdma_activeGauge1 if RDMA data path is active, 0 if fell back to local-only
memkv_rdma_connections_activeGaugeActive RDMA connections
memkv_rdma_read_latency_usHistogramRDMA read operation latency
memkv_rdma_write_latency_usHistogramRDMA write operation latency
memkv_rdma_errors_totalCounterRDMA transport errors (CQ/post failures)

Storage Devices

MetricTypeLabelsDescription
memkv_device_slots_usedGaugedeviceAllocated slots per device
memkv_device_capacity_raw_bytesGaugedeviceRaw NVMe capacity per device in bytes
memkv_device_capacity_usable_bytesGaugedevicePer-device bytes available for user data (raw minus MemKV metadata regions)
memkv_device_bytes_usedGaugedevicePer-device bytes currently allocated to blocks (allocated_slots * block_size)
memkv_storage_raw_bytesGaugeSum of raw NVMe capacity across all devices
memkv_storage_usable_bytesGaugeSum of usable capacity (data region only) across all devices
memkv_storage_used_bytesGaugeSum of bytes allocated to blocks across all devices
memkv_trim_extents_totalCounterdeviceExtents trimmed per device
memkv_trim_bytes_totalCounterdeviceBytes trimmed per device
memkv_trim_latency_usHistogramdeviceTRIM batch latency
memkv_trim_errors_totalCounterdeviceTRIM errors per device
memkv_trim_pendingGaugePending TRIM requests

Memory and Blocks

MetricTypeDescription
memkv_blocks_totalGaugeTotal blocks in index
memkv_blocks_persistedGaugeBlocks persisted to storage
memkv_blocks_by_stateGaugeBlock count by state
memkv_slabs_allocatedGaugeAllocated memory slabs
memkv_slabs_totalGaugeTotal memory slabs
memkv_memory_used_bytesGaugeMemory used for blocks
memkv_memory_total_bytesGaugeTotal memory available
memkv_uptime_secondsGaugeServer uptime in seconds (since process start)

Admin API

MetricTypeLabelsDescription
memkv_admin_requests_totalCounterpathAdmin endpoint requests by path

Storage Pipeline (hot-path tracing)

Fine-grained timings inside the write pipeline — useful for diagnosing sustained-write behaviour under load.

MetricTypeDescription
memkv_persist_usHistogramDirect-write end-to-end wall time (request entry → persist return)
memkv_persist_pre_wait_usHistogramEntry to just before the writer-worker handoff (extent lock, journal Alloc, …)
memkv_writer_wait_usHistogramWait time on the io_uring writer-worker channel
memkv_persist_post_wait_usHistogramAfter writer-worker completion to persist return (journal Commit, dirty queue)
memkv_iouring_write_usHistogramio_uring write submit-to-completion latency
memkv_iouring_fsync_usHistogramio_uring fsync submit-to-completion latency
memkv_iouring_fsync_batch_sizeHistogramData writes drained per fsync completion
memkv_allocate_duration_usHistogramFreeList slot allocate time
memkv_journal_flush_duration_usHistogramJournal buffer flush (fdatasync) time
memkv_journal_pending_entriesGaugeUn-flushed journal entries across devices
memkv_dirty_queue_depthGaugePending entries in per-device dirty queues
memkv_btree_cache_hits_totalCounterBTree leaf cache hits
memkv_btree_cache_misses_totalCounterBTree leaf cache misses
memkv_btree_leaf_miss_usHistogramBTree leaf miss-path resolution time
memkv_freelist_cas_retries_totalCounterFreeList head CAS retry count
memkv_slab_double_free_totalCounterBounce-buffer slabs freed more than once

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /v1/health
    port: 9901
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /v1/ready
    port: 9901
  initialDelaySeconds: 5
  periodSeconds: 5