Monitoring

Health endpoints, Prometheus metrics, status API, and Kubernetes probes for MemKV.

MemKV exposes health endpoints and Prometheus metrics on the admin HTTP endpoint (default: 0.0.0.0:9901). The client library ships separate exporters that are off by default.

Client-side exporters

memkv-client (the NIXL plugin and LD_PRELOAD shim) emits its own metrics — request latencies, transport state, batch sizing. Pick a sink by setting one of these env vars; with neither set, observability is disabled and no work is done in the hot path.

Variable	Type	Default	Notes
`MEMKV_OTEL_ENDPOINT`	URL	unset	OTLP/gRPC collector to ship spans + metrics to (e.g. `http://otel-collector:4317`). Falls back to `OTEL_EXPORTER_OTLP_ENDPOINT`. Requires the client built with the `otel` feature.
`MEMKV_PROMETHEUS_BIND`	`host:port`	unset	Bind address for an in-process Prometheus scrape endpoint at `/metrics`. Use when the client runs inside a long-lived process you can scrape directly. Requires the `prometheus` feature.

If both are set the OTLP path wins. The exporter is installed once per process — subsequent calls are no-ops.

Health Endpoints

All admin endpoints are versioned under /v1/.

Endpoint	Purpose	Response
`GET /v1/health`	Liveness probe	JSON `{status, rdma_active, drives_online, drives_offline, drives_total}` (macOS: `{status}` only)
`GET /v1/ready`	Readiness probe	`ok` (200)
`GET /v1/status`	Detailed status	JSON with memory, storage, RDMA stats
`GET /v1/metrics`	Prometheus metrics	`text/plain` Prometheus format

curl http://localhost:9901/v1/health
# {"status":"healthy","rdma_active":true,"drives_online":12,"drives_offline":0,"drives_total":12}

status is one of healthy, degraded, or unhealthy. HTTP status code mirrors the body — 200 for healthy, 503 for unhealthy.

memkv admin status --json mirrors the full /v1/status shape below, including the storage capacity totals (raw_bytes_total, usable_bytes_total, used_bytes_total) and the per-device byte fields. The plain memkv admin status table surfaces storage fill as used / usable plus a percentage.

Status Endpoint

curl http://localhost:9901/v1/status

{
  "memory": {
    "used_bytes": 1073741824,
    "total_bytes": 265289728000,
    "slabs_allocated": 512,
    "slabs_total": 126464
  },
  "storage": {
    "num_devices": 24,
    "block_size_bytes": 4194304,
    "total_blocks": 1000,
    "raw_bytes_total": 384000000000000,
    "usable_bytes_total": 383979687936000,
    "used_bytes_total": 176160768,
    "devices": [
      {
        "device_id": 0,
        "path": "/dev/nvme0n1",
        "total_slots": 3815461,
        "allocated_slots": 42,
        "raw_capacity_bytes": 16000000000000,
        "usable_bytes": 15999153635328,
        "metadata_bytes": 846364672,
        "used_bytes": 176160768
      }
    ]
  },
  "rdma": {
    "active_connections": 8
  },
  "blocks_total": 1000,
  "blocks_persisted": 950,
  "uptime_seconds": 3612
}

Prometheus Metrics

Request Operations

Metric	Type	Description
`memkv_allocate_total`	Counter	Total allocate requests
`memkv_lookup_total`	Counter	Total lookup requests
`memkv_delete_total`	Counter	Total delete requests
`memkv_commit_total`	Counter	Total commit requests
`memkv_read_total`	Counter	Total direct-IO read requests
`memkv_read_bytes_total`	Counter	Total bytes served on direct-IO reads
`memkv_write_total`	Counter	Total direct-IO write requests
`memkv_write_bytes_total`	Counter	Total bytes persisted on direct-IO writes
`memkv_connect_total`	Counter	Total RDMA connect requests
`memkv_exists_total`	Counter	Total exists requests
`memkv_not_found_total`	Counter	Total not-found responses
`memkv_no_space_total`	Counter	Total no-space responses
`memkv_invalid_requests_total`	Counter	Total invalid/malformed requests
`memkv_request_latency_us`	Histogram	End-to-end request latency

RDMA Transport

Metric	Type	Description
`memkv_rdma_active`	Gauge	1 if RDMA data path is active, 0 if fell back to local-only
`memkv_rdma_connections_active`	Gauge	Active RDMA connections
`memkv_rdma_put_latency_us`	Histogram	RDMA put latency (server RDMA read from client on a put)
`memkv_rdma_get_latency_us`	Histogram	RDMA get latency (server RDMA write to client on a get)
`memkv_rdma_errors_total`	Counter	RDMA transport errors (CQ/post failures)

TCP Transport

Metric	Type	Description
`memkv_tcp_put_total`	Counter	TCP put data frames served (inline put or streamed chunk)
`memkv_tcp_get_total`	Counter	TCP get data frames served (inline get or streamed chunk)
`memkv_tcp_put_bytes_total`	Counter	Payload bytes accepted by TCP puts
`memkv_tcp_get_bytes_total`	Counter	Payload bytes served by TCP gets
`memkv_tcp_put_latency_us`	Histogram	Server-side TCP put handling latency (excludes socket write)
`memkv_tcp_get_latency_us`	Histogram	Server-side TCP get handling latency (excludes socket write)

Storage Devices

Metric	Type	Labels	Description
`memkv_device_slots_used`	Gauge	device	Allocated slots per device
`memkv_device_capacity_raw_bytes`	Gauge	device	Raw NVMe capacity per device in bytes
`memkv_device_capacity_usable_bytes`	Gauge	device	Per-device bytes available for user data (raw minus MemKV metadata regions)
`memkv_device_bytes_used`	Gauge	device	Per-device bytes currently allocated to blocks (`allocated_slots * block_size`)
`memkv_storage_raw_bytes`	Gauge	—	Sum of raw NVMe capacity across all devices
`memkv_storage_usable_bytes`	Gauge	—	Sum of usable capacity (data region only) across all devices
`memkv_storage_used_bytes`	Gauge	—	Sum of bytes allocated to blocks across all devices
`memkv_trim_extents_total`	Counter	device	Extents trimmed per device
`memkv_trim_bytes_total`	Counter	device	Bytes trimmed per device
`memkv_trim_latency_us`	Histogram	device	TRIM batch latency
`memkv_trim_errors_total`	Counter	device	TRIM errors per device
`memkv_trim_pending`	Gauge	—	Pending TRIM requests

Memory and Blocks

Metric	Type	Description
`memkv_blocks_total`	Gauge	Total blocks in index
`memkv_blocks_persisted`	Gauge	Blocks persisted to storage
`memkv_blocks_by_state`	Gauge	Block count by state
`memkv_slabs_allocated`	Gauge	Allocated memory slabs
`memkv_slabs_total`	Gauge	Total memory slabs
`memkv_memory_used_bytes`	Gauge	Memory used for blocks
`memkv_memory_total_bytes`	Gauge	Total memory available
`memkv_uptime_seconds`	Gauge	Server uptime in seconds (since process start)

Admin API

Metric	Type	Labels	Description
`memkv_admin_requests_total`	Counter	path	Admin endpoint requests by path

Storage Pipeline (hot-path tracing)

Fine-grained timings inside the write pipeline — useful for diagnosing sustained-write behaviour under load.

Metric	Type	Description
`memkv_persist_us`	Histogram	Direct-write end-to-end wall time (request entry → persist return)
`memkv_persist_pre_wait_us`	Histogram	Entry to just before the writer-worker handoff (extent lock, journal Alloc, …)
`memkv_writer_wait_us`	Histogram	Wait time on the io_uring writer-worker channel
`memkv_persist_post_wait_us`	Histogram	After writer-worker completion to persist return (journal Commit, dirty queue)
`memkv_iouring_write_us`	Histogram	io_uring write submit-to-completion latency
`memkv_iouring_fsync_us`	Histogram	io_uring fsync submit-to-completion latency
`memkv_iouring_fsync_batch_size`	Histogram	Data writes drained per fsync completion
`memkv_allocate_duration_us`	Histogram	FreeList slot allocate time
`memkv_journal_flush_duration_us`	Histogram	Journal buffer flush (fdatasync) time
`memkv_journal_pending_entries`	Gauge	Un-flushed journal entries across devices
`memkv_dirty_queue_depth`	Gauge	Pending entries in per-device dirty queues
`memkv_btree_cache_hits_total`	Counter	BTree leaf cache hits
`memkv_btree_cache_misses_total`	Counter	BTree leaf cache misses
`memkv_btree_leaf_miss_us`	Histogram	BTree leaf miss-path resolution time
`memkv_freelist_cas_retries_total`	Counter	FreeList head CAS retry count
`memkv_slab_double_free_total`	Counter	Bounce-buffer slabs freed more than once

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /v1/health
    port: 9901
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /v1/ready
    port: 9901
  initialDelaySeconds: 5
  periodSeconds: 5

On this page