Alerts

AIStor Server publishes cluster and node metrics using the Prometheus Data Model. You can use any scraping tool to pull metrics data from AIStor Server for further analysis and alerting.

This page provides guidance on baseline P0 and P1 alerts for use with building alerting rules and infrastructure around the available V2 or V3 metrics. These alerts assume infrastructure similar to or exceeding our reference hardware guidelines. Modify them to reflect your needs or open a SUBNET issue for further guidance.

Alerts for V3 metrics

The following alerts utilize AIStor’s v3 metrics API for constructing alerts. The examples provided use Prometheus AlertManager rule formatting.

P0 Alerts

P0 or Priority 0 alerts indicate a critical scenario that requires the highest priority in response and remediation.

Write quorum loss imminent

This alert triggers if the loss of any single additional drive in an erasure set would result in write quorum loss.

Loss of write quorum prevents all write operations to the affected erasure set, causing immediate application failures.

Modifying the duration for can account for conditions with transient drive failure and restoration, where longer periods allow for normal operations to resume before triggering the alert. However, given the risk of data loss due to drive failure, shorter time frames ensures rapid identification of the critical issue.

alert: ErasureSetNearingQuorumLoss
expr: |
  minio_cluster_erasure_set_write_tolerance <=1
for: 1m
labels:
  severity: critical
annotations:
  summary: "Erasure set {{ $labels.pool_id }}/{{ $labels.set_id }} operating at minimum capacity"
  impact: "No redundancy buffer. One more drive failure will cause data unavailability."
  action: "Investigate and restore at least 1/2 drives necessary for write quorum. Monitor closely."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/erasure-set

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Erasure set offline drives exceeds 1/2 set quorum

This alert triggers if the number of offline erasure set drives exceeds 1/2 the write quorum value for the cluster.

Loss of write quorum prevents all write operations to the affected erasure set, causing immediate application failures.

Modifying the duration for can account for conditions with transient drive failure and restoration, where longer periods allow for normal operations to resume before triggering the alert.

alert: ErasureSetQuorumLossImminent
expr: |
  minio_cluster_erasure_set_write_tolerance <=
  floor(minio_cluster_erasure_set_write_quorum/2)
for: 5m
labels:
  severity: critical
annotations:
  summary: "Erasure set {{ $labels.pool_id }}/{{ $labels.set_id }} at 1/2 write availability"
  impact: "No redundancy buffer. One more drive failure will cause data unavailability."
  action: "Investigate offline drives immediately. Plan drive replacement. Monitor closely."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/erasure-set

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Cluster node offline

This alert triggers if any node in the cluster goes offline. Offline node reduces cluster capacity and may impact quorum for erasure sets on that node.

Modifying the duration for can account for conditions with transient node connectivity.

Modifying the node count to exceed 0 may apply in very large clusters (1000+ nodes). For example, you can use a percentage of total nodes in the cluster based on your infrastructures fault tolerance. Consult with engineering through SUBNET before modifying this to a nonzero value.

alert: ClusterNodeOffline
expr: minio_cluster_health_nodes_offline_count > 0
for: 2m
labels:
  severity: critical
annotations:
  summary: "{{ $value }} MinIO node(s) are offline"
  impact: "Reduced cluster capacity. Potential quorum risk if additional failures occur."
  action: "Identify offline nodes. Check node health, network connectivity, and process status."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

/minio/metrics/v3/cluster/health

Increasing server error rate

This alert triggers if the rate of 5xx errors increases beyond the configured threshold. 5xx errors indicate that the AIStor server has failed to process a client request.

This alert specifically triggers if the rate of 5xx errors increases for a set period of time to capture large spikes in error rates. Depending on the infrastructre and workload, some level of 5xx errors may be common or normal.

The alert sets a rate of 1 over a 5 minute period, sustained for at least 2 minutes. This corresponds to 1 error/second or 300 errors / 5 minutes. You can tune the rate for lower (0.5) or higher (5) tolerance depending on the size of the cluster and the typical amount of 5xx errors during normal operations.

alert: HighServerErrorRate
expr: rate(minio_api_requests_5xx_errors_total[5m]) > 1
for: 2m
labels:
  severity: critical
annotations:
  summary: "High 5xx error rate on {{ $labels.server }}: {{ $value | humanize }} errors/sec"
  impact: "Server-side failures causing workload failures. SLA violations likely."
  action: "Check {{ $labels.server }} logs immediately. Investigate resource exhaustion, drive failures, or software issues."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/api/requests

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Increasing site replication rate

This alert triggers if the rate of site replication errors increases beyond the configured threshold. Site replication errors indicate that the cluster cannot replicate data successfully to one or more configured peers.

alert: HighReplicationErrorRate
expr: rate(minio_cluster_replication_errors_total[5m]) > 1
for: 2m
labels:
  severity: critical
annotations:
  summary: "Cluster replication errors exceeding {{ $value | humanize }}/s"
  impact: "Data not synchronized. RPO violated. Disaster recovery capability compromised."
  action: "Check replication target availability. Verify credentials and network connectivity."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/replication

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Bucket replication failures

This alert triggers if the rate of bucket replication errors increases beyond the configured threshold. Bucket replication errors indicate that the cluster cannot replicate data successfully to one or more configured peers.

alert: HighBucketReplicationErrorRate
expr: rate(minio_bucket_replication_total_failed_count[5m]) > 1
for: 2m
labels:
  severity: critical
annotations:
  summary: "Bucket replication errors for {{ $labels.bucket }} exceeding {{ $value | humanize }}/s"
  impact: "Data not synchronized. RPO violated. Disaster recovery capability compromised."
  action: "Check replication target availability. Verify credentials and network connectivity."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/bucket/replication

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Site replication queue growing

This alert triggers based on the combination of replication queue size and growth:

> 10000: Queue of 10k objects represents ~1-10GB of unreplicated data depending on average object size
deriv[10m] > 0: Queue showing growth
for: 10m: Queue growth sustained over time

Small object workloads can use larger queue values, such as > 50000 for 50K objects or ~25-50GB Large object workloads can use smaller queue values, such as > 1000 for 1K objects or ~10-100GB

Unbounded queue growth indicates increasing replication lag and may risk SLA/SLO around data safety/restoration.

alert: ReplicationQueueGrowing
expr: |
  minio_cluster_replication_queued_count > 10000 and
  deriv(minio_cluster_replication_queued_count[10m]) > 0
for: 10m
labels:
  severity: critical
annotations:
  summary: "Replication queue to {{ $labels.endpoint }} growing: {{ $value }} objects queued"
  impact: "Increasing replication lag. Extended RPO. Risk of memory exhaustion."
  action: "Check replication bandwidth and target capacity. Investigate if target {{ $labels.endpoint }} is saturated."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/replication

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Storage capacity nearing production stoppage

This alert triggers if the free capacity of the cluster drops below 15%. At full capacity the cluster cannot process any further write operations. Consider this alert an imminent threat of downtime due to storage exhaustion.

alert: StorageCapacityExhausted
expr: |
  (minio_cluster_health_capacity_usable_free_bytes /
   minio_cluster_health_capacity_usable_total_bytes) < 0.15
for: 10m
labels:
  severity: critical
annotations:
  summary: "Cluster storage {{ $value | humanizePercentage }} free (below 15%)"
  impact: "Storage nearly exhausted, approaching production stoppage."
  action: "Immediately implement capacity recovery, such as removing aged data or migrating workloads to other clusters."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/health

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Storage capacity critically low

This alert triggers if the free capacity of the cluster drops below 30%.

As capacity decreases further, the cluster may exhibit performance degradation as both storage space and inode availability drop. At full capacity the cluster cannot process any further write operations.

The 30% threshold provies buffer to allow for planning and execution of expansion, ILM cleanup, or batch cleanup operations.

alert: StorageCapacityCritical
expr: |
  (minio_cluster_health_capacity_usable_free_bytes /
   minio_cluster_health_capacity_usable_total_bytes) < 0.30
for: 10m
labels:
  severity: critical
annotations:
  summary: "Cluster storage {{ $value | humanizePercentage }} free (below 30%)"
  impact: "Approaching storage exhaustion. Write operations will fail when full."
  action: "Plan capacity expansion. Review data lifecycle policies. Estimate time to full based on growth rate."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/health

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Storage capacity rapidly decreasing

This alert triggers if the free capacity rapidly decreases within an hour, with a sustained rate within a 30 minute period. This may indicate that a client process or workload has exceeded normal operations.

The threshold value of 1 corresponds to 1GB/hour, where the expr has a conversion factor of 1024 * 1024 * 1024 for bytes to gigabytes. Modify this value based on the typical capacity decrease/usage of normal operations. The following table provides a quick reference for rates and their threshold values:

Rate	Threshold Value
1GB/hour	`1`
10GB/hour	`10`
100GB/hour	`100`
1TB/hour	`1024`

alert: StorageCapacityDecreasing
expr: |
  deriv(minio_cluster_health_capacity_usable_free_bytes[1h]) / (1024 * 1024 * 1024) < -1
for: 30m
labels:
  severity: critical
annotations:
  summary: "Cluster storage decreasing rapidly (>1GB/hour)"
  impact: "Faster than expected capacity consumption. May run out sooner than planned."
  action: "Investigate workload changes. Check for data growth anomalies. Accelerate capacity planning."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/health

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Storage capacity rapidly increasing

This alert triggers if the free capacity rapidly increase within an hour, with a sustained rate within a 30 minute period. This may help capture an unexpeted deletion process, such as an over-broad DeleteObject command.

The threshold value of 1 corresponds to 1GB/hour, where the expr has a conversion factor of 1024 * 1024 * 1024 for bytes to gigabytes. Modify this value based on the typical capacity increase of combined client, ILM, and batch delete operations. The following table provides a quick reference for rates and their threshold values:

Rate	Threshold Value
1GB/hour	`1`
10GB/hour	`10`
100GB/hour	`100`
1TB/hour	`1024`

alert: StorageFreeSpaceIncreasing
expr: |
  deriv(minio_cluster_health_capacity_usable_free_bytes[1h]) / (1024 * 1024 * 1024) > 1
for: 30m
labels:
  severity: critical
annotations:
  summary: "Cluster free space increasing rapidly (>1GB/hour)"
  impact: "Unexpected data deletion. Potential data loss from misconfigured ILM or errant delete operations."
  action: "Investigate deletion activity. Check ILM policies and recent DeleteObject operations."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/health

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Key Management Server unavailable

This alert triggers if the MinIO Key Management Service (KMS) goes offline. MinIO KMS supports Server Side Encryption (SSE) in AIStor deployments. An offline KMS prevents en/decryption operations of SSE-enabled buckets or objects.

This alert does not apply to clusters running without SSE enabled.

alert: KMSUnavailable
expr: minio_kms_online == 0
for: 1m
labels:
  severity: critical
annotations:
  summary: "KMS is offline on {{ $labels.server }}"
  impact: "Cannot decrypt encrypted objects. Complete data unavailability for encrypted buckets."
  action: "Verify KMS service availability. Check network connectivity to KMS endpoint. Verify credentials."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/kms

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Audit log messaging failure

This alert triggers if the rate of audit log errors exceeds a rate of 1 over a 5 minute period, sustained for at least 2 minutes. This corresponds to 1 error/second or 300 errors / 5 minutes. You can tune the rate for lower (0.5) or higher (5) tolerance depending on the size of the cluster and the typical amount of audit errors during normal operations.

Loss of audit log messages can indicate a critical failure in the path between AIStor and the configured log target. Investigate and remediate connectivity or configuration issues to ensure delivery of audit messages.

alert: HighAuditLogErrorRate
expr: rate(minio_audit_failed_messages[5m]) > 1
for: 2m
labels:
  severity: critical
annotations:
  summary: "High audit log error rate on {{ $labels.server }}: {{ $value | humanize }} errors/sec"
  impact: "Potential loss of audit log events and violation of auditing policies"
  action: "Check {{ $labels.server }} logs immediately. Investigate networking issues or availability of configured remote audit target."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/audit

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

P1 Alerts

P1 or Priority 1 alerts indicate a scenario that requires high priority in response and remediation.

Storage capacity low

This alert triggers if the free capacity of the cluster drops below 40%.

The 40% threshold provides sufficient time to plan capacity reclamation operations or cluster expansion.

alert: StorageCapacityLow
expr: |
  (minio_cluster_health_capacity_usable_free_bytes /
   minio_cluster_health_capacity_usable_total_bytes) < 0.40
for: 10m
labels:
  severity: warning
annotations:
  summary: "Cluster storage {{ $value | humanizePercentage }} free (below 40%)"
  impact: "Approaching storage exhaustion. Write operations will fail when full."
  action: "Plan capacity expansion. Review data lifecycle policies. Estimate time to full based on growth rate."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/health

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Goroutine count exceeding normal levels

This alert triggers if the number of active goroutines for an AIStor process exceeds 10,000.

High goroutine counts may indicate a leak, hung process, or other issue. Investigation and remediation can prevent node failure.

You can tune the value of 10000 lower or higher based on the average number of goroutines in the cluster during normal operations.

alert: GoroutineCountHigh
expr: minio_system_process_go_routine_total > 10000
for: 10m
labels:
  severity: warning
annotations:
  summary: "Node {{ $labels.server }} has {{ $value }} goroutines (threshold: 10000)"
  impact: "Potential memory leak. Risk of OOM and node crash."
  action: "Review {{ $labels.server }} logs for stuck connections or hanging operations. Consider controlled restart if trend continues."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/system/process

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Goroutine count rapidly increasing

This alert triggers if the number of active goroutines increases over a set period of time.

High goroutine counts may indicate a leak, hung process, or other issue. Investigation and remediation can prevent node failure.

The alert sets a rate of 10 over a 5 minute period, sustained for at least 2 minutes. This corresponds to 10 goroutines/second or 3000 goroutines / 5 minutes. You can tune the rate for lower (0.5) or higher (5) tolerance depending on the size of the cluster and the typical amount of goroutines used during normal operations.

alert: GoroutineCountRapidlyIncreasing
expr: deriv(minio_system_process_go_routine_total[5m]) > 10
for: 10m
labels:
  severity: warning
annotations:
  summary: "Goroutine count on {{ $labels.server }} increasing at {{ $value | humanize }}/sec"
  impact: "Potential memory leak. Risk of OOM and node crash."
  action: "Review {{ $labels.server }} logs for stuck connections or hanging operations. Consider controlled restart if trend continues."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/system/process

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Increasing client error rate

This alert triggers if the rate of 4xx errors increases beyond the configured threshold. 4xx errors indicate faulty client operations, such as malformed requests or bad/expired credentials.

This alert specifically triggers if the rate of 4xx errors increases for a set period of time to capture large spikes in error rates. Depending on the infrastructre and workload, some level of 4xx errors may be common or normal.

alert: HighClientErrorRate
expr: rate(minio_api_requests_4xx_errors_total[5m]) > 1
for: 2m
labels:
  severity: warning
annotations:
  summary: "High 4xx error rate on {{ $labels.server }}: {{ $value | humanize }} errors/sec"
  impact: "Client applications experiencing errors. May indicate misconfiguration or authentication issues."
  action: "Review error types in logs. Check for authentication failures, missing buckets, or invalid requests."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/api/requests

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Erasure set health degraded

This alert triggers if any erasure set has any offline drives.

While erasure sets can operate in a degraded state, ops teams should plan to inspect and repair/replace the failed infrastructure at the earliest opportunity. Multiple failed drives or nodes in an erasure set may cause quorum loss.

You can increase the for duration beyond 15m to account for transient failures. Environments with transient failures longer than 30m should investigate and resolve the underlying condition.

alert: ErasureSetDegraded
expr: minio_cluster_erasure_set_health == 0
for: 15m
labels:
  severity: warning
annotations:
  summary: "Erasure set {{ $labels.pool_id }}/{{ $labels.set_id }} is degraded"
  impact: "Reduced redundancy. Additional drive failure could cause data unavailability."
  action: "Check drive status in pool {{ $labels.pool_id }}, set {{ $labels.set_id }}. Plan drive replacement if offline."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/cluster/erasure-set

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Drive offline or unavailable

This alert triggers for any offline drive in the cluster. This alert is complementary to the erasure set degraded alert as it provides more direct information on which drive has failed. Due to erasure set distribution semantics, a large number of offline drives does not necessarily indicate data loss. It does require addressing the underlying issue to mitigate the risk of further interruption or potential loss of quorum/data.

You can increase the for duration beyond 10m to account for transient failures. Environments with transient failures longer than 30m should investigate and resolve the underlying condition.

alert: DriveOffline
expr: minio_system_drive_health == 0
for: 10m
labels:
  severity: warning
annotations:
  summary: "Drive {{ $labels.drive }} at index {{ $labels.drive_index }} in server {{$labels.server}} is offline."
  impact: "Reduced redundancy. Additional drive failure could cause data unavailability."
  action: "Check drive in {{ $labels.server }} - {{ $labels.drive }}. Plan drive replacement if drive is unrecoverable."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/system/drive

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Memory usage high

This alert triggers if the percentage of total memory used exceeds 90%. Memory exhaustion can cause kernels to trigger OOM process killers, which in turn may kill the AIStor server process. Monitoring memory pressure enables proactive intervention.

You can decrease the threshold to create a more aggressive warning in systems known to have memory pressure issues.

alert: MemoryUsageHigh
expr: minio_system_memory_used_perc > 90
for: 10m
labels:
  severity: warning
annotations:
  summary: "Memory usage on {{ $labels.server }} at {{ $value }}%"
  impact: "Risk of OOM condition. Process may be killed by kernel."
  action: "Check memory-consuming processes. Review goroutine count. Consider increasing memory or reducing workload."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/system/memory

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Memory usage rapidly increasing

This alert triggers if the percentage of total memory used increases by more than 1.25% in 15 minutes or 5% in an hour, when the existing system memory used percentage is over 50%. This provides early warning detection in the event of a sudden increase of memory usage where further exhaustion could lead to the OS triggering OOM process killers.

alert: MemoryUsageIncreasing
expr: |
  deriv(minio_system_memory_used_perc[15m]) > 1.25 and
  minio_system_memory_used_perc > 50
for: 10m
labels:
  severity: warning
annotations:
  summary: "Memory usage on {{ $labels.server }} increasing rapidly ({{ $value }}%/15min)"
  impact: "Memory leak or workload increase. Approaching OOM condition."
  action: "Investigate memory growth on {{ $labels.server }}. Check for goroutine leaks or cache growth."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/system/memory

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

Scanner stalled or not running

This alert triggers if the last scanner activity exceeds 48 hours or 172800 seconds in a 1 hour period. This indicates that the scanner may have stalled or stopped running for an extended period of time.

alert: ScannerStalled
expr: minio_scanner_last_activity_seconds > 172800
for: 1h
labels:
  severity: warning
annotations:
  summary: "Scanner inactive on {{ $labels.server }} for {{ $value | humanizeDuration }}"
  impact: "Usage data may be stale. Metadata consistency checks not running."
  action: "Check scanner process status on {{ $labels.server }}. Review system logs for errors."

This alert requires the Prometheus scraping configuration capture the metrics provided in the following API endpoint(s):

/minio/metrics/v3/scanner

A scrape job using the root /minio/metrics/v3 endpoint satisfies the above requirement.

File descriptor exhaustion

This alert triggers if the number of open file descriptors exceeds the system limit. This prevents further operations from accessing resources like files, network sockets, and pipes.

alert: FileDescriptorExhaustion
expr: (minio_system_process_file_descriptor_open_total / minio_system_process_file_descriptor_limit_total) > 0.90
for: 2m
labels:
  severity: warning
annotations:
  summary: "MinIO process on {{ $labels.server }} using {{ $value | printf '%.2f' }}% of available file descriptors"
  impact: "File descriptor exhaustion prevents opening new connections or files, resulting in halting of further operations on the server"
  action: "Increase ulimit on server or investigate potential connection leaks."