Healthcheck Probes

Each AIStor server process exposes unauthenticated endpoints for probing server uptime and deployment high availability for simple healthchecks. These endpoints return an HTTP status code indicating whether the underlying resource is healthy or satisfies read/write quorum. The server exposes no other data through these endpoints.

AIStor liveness

Use the following endpoint to test if the specified AIStor server is up and ready to serve requests:

curl -I https://aistor.example.net:9000/minio/health/live

Replace https://aistor.example.net:9000 with the DNS hostname and port of the server to check.

The endpoint returns one of the following HTTP codes:

HTTP Code	Description
`200 OK`	Server is online and functional.
`429 Too Many Requests`	Server is experiencing high load. Load balancers should route traffic elsewhere.

The server returns 429 Too Many Requests when experiencing critical thread pressure or when the request queue capacity is exceeded. See 429 Too Many Requests for details about these scenarios and response headers.

The healthcheck probe alone cannot determine if the server is offline - only that the current host machine cannot reach the server. Consider configuring a Prometheus alert using the minio_cluster_servers_offline_total metric to detect whether one or more AIStor Server servers are offline.

Cluster write readiness

Use the following endpoint to check if the local AIStor node views the cluster as ‘ready’ to process write operations:

curl -I https://aistor.example.net:9000/minio/health/cluster

Replace https://aistor.example.net:9000 with the DNS hostname and port of any server in the deployment to check. For clusters using a load balancer to manage incoming connections, specify the hostname of the load balancer.

The target node queries its peers for their current drive health and status. A ‘ready’ response indicates that enough peer nodes responded as fully initialized with sufficient healthy drives to support write quorum.

This endpoint alone cannot determine the uptime status of the target node or its peers. For detecting and alerting on node downtime, configure Prometheus alerts using one of the following V3 metrics to detect potential issues or errors on the cluster:

minio_cluster_servers_offline_total to alert if one or more servers are offline.
minio_server_drive_free_bytes to alert if the deployment is running low on free drive space.

Distributed readiness check

Use the distributed=true query parameter to verify cluster health from all nodes’ perspectives:

curl -I "https://aistor.example.net:9000/minio/health/cluster?distributed=true"

A distributed readiness check performs a fan-out call to all peer nodes requesting they each perform their own health check. The response then indicates whether all peer nodes agree on cluster readiness, instead of relying on only the single local node’s view.

Combine distributed=true with maintenance=true to verify if a specific node can be safely taken offline while ensuring all other nodes see the cluster as healthy:

curl -I "https://aistor.example.net:9000/minio/health/cluster?distributed=true&maintenance=true"

Response codes and headers

The endpoint returns one of the following HTTP codes:

HTTP Code	Description
`200 OK`	Sufficient online nodes and healthy drives for write operations.
`429 Too Many Requests`	Server is under critical thread pressure. Load balancers should route traffic elsewhere.
`503 Service Unavailable`	Insufficient online nodes or healthy drives for write operations.

The response includes the following headers:

Header	Description
`X-Minio-Write-Quorum`	Number of drives required to satisfy write quorum
`X-Minio-Storage-Class-Defaults`	`true` if using default storage class settings
`X-Minio-Healing-Drives`	Number of drives currently healing (only present if greater than 0)
`X-Minio-Server-Status`	Reason for failure or degraded state (see Understanding error responses)
`Retry-After`	Seconds to wait before retrying (only present on 429 responses)

The result of the probe alone does not determine the health of the target server or the cluster’s ability to process operations. It only indicates the local view of the cluster health and status from the targeted node. For example, if the node has a network partition between a peer, the return status reflects that peer as ‘offline’ even if it otherwise is healthy and servicing requests. Similarly, the node may not have fully initialized during the healthcheck while the remainder of the cluster is healthy and operational.

Cluster read readiness

Use the following endpoint to check if the local AIStor node views the cluster as ‘ready’ to process read operations:

curl -I https://aistor.example.net:9000/minio/health/cluster/read

Replace https://aistor.example.net:9000 with the DNS hostname and port of a server in the deployment to check. For clusters using a load balancer to manage incoming connections, specify the hostname of the load balancer.

The target node queries its peers for their current drive health and status. A ‘ready’ response indicates that enough peer nodes responded as fully initialized with healthy drives to support read quorum.

This endpoint alone cannot determine the uptime status of any given peer node. For detecting and alerting on node downtime, configure Prometheus alerts using one of the following metrics to detect potential issues or errors on the cluster:

minio_cluster_servers_offline_total to alert if one or more servers are offline.
minio_server_drive_free_bytes to alert if the deployment is running low on free drive space.

Response codes and headers

The endpoint returns one of the following HTTP codes:

HTTP Code	Description
`200 OK`	Sufficient online nodes and healthy drives for read operations.
`429 Too Many Requests`	Server is under critical thread pressure. Load balancers should route traffic elsewhere.
`503 Service Unavailable`	Insufficient online nodes or healthy drives for read operations.

The response includes the following headers:

Header	Description
`X-Minio-Read-Quorum`	Number of drives required to satisfy read quorum
`X-Minio-Storage-Class-Defaults`	`true` if using default storage class settings
`X-Minio-Healing-Drives`	Number of drives currently healing (only present if greater than 0)
`X-Minio-Server-Status`	Reason for failure or degraded state (see Understanding error responses)
`Retry-After`	Seconds to wait before retrying (only present on 429 responses)

Cluster maintenance check

Use the following endpoint to test if an AIStor deployment can maintain both read and write quorum if the target node is taken down for maintenance:

curl -I https://aistor.example.net:9000/minio/health/cluster?maintenance=true

Replace https://aistor.example.net:9000 with the DNS hostname and port of a server in the deployment to check.

Response codes

HTTP Code	Description
`200 OK`	Deployment can maintain quorum if this server goes offline
`412 Precondition Failed`	Returned only when `maintenance=true`. Indicates the deployment will lose quorum if this server goes offline.
`503 Service Unavailable`	Server is unavailable (check `X-Minio-Server-Status` header for the reason)

The response alone does not indicate the health or availability of the node. It only indicates whether the cluster can tolerate taking the node offline for maintenance operations.

Understanding error responses

The cluster healthcheck endpoints may return error responses indicating issues with the server or cluster health.

503 Service Unavailable

The endpoint returns 503 Service Unavailable if the target server could not verify the health status of the cluster. This includes scenarios in which the target server has not fully initialized or has encountered an error preventing startup.

The contents of the X-Minio-Server-Status response header contains the specific reason for the failure. Some failures may clear given sufficient time for the server to start up or retry startup operations. If the status persists beyond the normal startup time, check the server logs for errors.

The following table lists status values returned as part of the X-Minio-Server-Status header:

Status	Cause	Resolution
`offline`	Server is starting up, restarting, or failed to start due to configuration errors.	Wait for startup to complete. If it persists, check server logs and verify drive mounts.
`bucket-metadata-offline`	Server is loading bucket metadata, or drives containing metadata are unavailable.	Wait for metadata loading. Check drive health and server logs for metadata errors.
`iam-offline`	Server is loading IAM policies, or IAM data is being synchronized across the cluster.	Wait for IAM initialization. Check server logs for IAM-related errors.
`restarting`	Server is restarting.	Wait for server to restart and complete initialization. Check server logs for errors.
`license-offline`	Server failed license check.	Ensure license is up-to-date, valid, and associated to an active SUBNET account.
`license-readonly`	License has expired.	Update the license for the node.
`grid-offline`	Server networking layer not initialized, or internal RPC port blocked by firewall.	Wait for startup to complete. Verify firewall rules allow inter-node traffic.
`grid-none-online`	Network partition, all other nodes offline, or DNS resolution failures.	Verify network connectivity between nodes. Check that peer nodes are running and DNS resolves correctly.

For readiness queries with the distributed flag, the header supports the following additional values:

Status	Cause	Resolution
`peer-unreachable:{HOST}`	The target server could not connect to the specified `{HOST}`.	Check network connectivity between the target and the peer.
`peer-no-response:{HOST}`	The target server did not receive a response from the specified `{HOST}`.	Check logs of remote peer for errors or issues which would prevent timely response.
`peer-unhealthy:{HOST}`	The `{HOST}` reported itself as unhealthy.	Check logs of remote peer to determine source of health issues.
`distributed-fanout-failed:{ERROR}`	The RPC fanout query to peer nodes failed with `{ERROR}`	Check network connectivity between the target and its peers.

429 Too Many Requests

The endpoint returns 429 Too Many Requests when the server is experiencing high load that could impact availability. This response signals to load balancers and orchestrators that traffic should be routed away from this node.

The server returns 429 in the following scenarios:

Thread pressure - Server is under critical thread pressure (≥85% of kernel thread limit by default). This helps prevent thread exhaustion before it causes process termination. When thread pressure triggers a 429 response, the response includes:

X-Minio-Server-Status: thread-pressure header
Retry-After: 120 header

See Thread Pressure Configuration for configuration options.

Request queue capacity exceeded - The number of requests in the server’s queue exceeds the configured capacity limit. This occurs when the server cannot process incoming requests fast enough, signaling clients to reduce their request rate. The capacity defaults to an auto-calculated value based on available memory.

Quorum failures

When a 503 response has no X-Minio-Server-Status header, the cluster does not have sufficient drives online to meet quorum requirements. This occurs when too many drives are offline due to hardware failures, maintenance, or unavailable mount points.

Use mc admin info to identify offline drives. The X-Minio-Healing-Drives response header indicates if healing is in progress on replacement drives. The number of drives required for quorum depends on the erasure code parity configuration.

Thread pressure configuration

AIStor monitors OS thread count against kernel limits and returns HTTP 429 on health endpoints when under critical pressure. This allows load balancers to automatically route traffic away from overloaded nodes before thread exhaustion causes process termination.

Thread pressure levels:

Normal (0): Less than 50% of kernel threads-max limit (default warning threshold)
Warning (1): 50-84% of limit - server logs events for monitoring
Critical (2): 85% or higher (default critical threshold) - health endpoints return 429

Requires AIStor RELEASE.2026-02-02T23-40-11Z or later.

Configuration settings

Configure thread pressure monitoring using the api subsystem:

Setting	Description	Default
`thread_pressure_check`	Enable or disable thread pressure monitoring for health endpoints.	`on`
`thread_pressure_warning`	Warning threshold as a ratio (0-1). Server logs events when this threshold is met.	`0.50`
`thread_pressure_critical`	Critical threshold as a ratio (0-1). Health endpoints return 429 above this level.	`0.85`

Use mc admin config to modify these settings:

mc admin config set ALIAS api thread_pressure_warning=0.60
mc admin config set ALIAS api thread_pressure_critical=0.90
mc admin config set ALIAS api thread_pressure_check=off
mc admin service restart ALIAS

Replace ALIAS with the alias of the AIStor deployment.

Monitoring thread pressure

Monitor thread pressure using Prometheus metrics from the /minio/metrics/v3/system/process endpoint:

minio_system_process_threads_total - Current number of OS threads
minio_system_process_threads_max - Maximum threads allowed by kernel
minio_system_process_thread_pressure - Current pressure level (0, 1, or 2)

Set up alerts when minio_system_process_thread_pressure reaches warning (1) or critical (2) levels to proactively address resource constraints.