Monitoring and alerting using Prometheus
AIStor Server publishes cluster, node, bucket, and resource metrics using the Prometheus Data Model. The procedure on this page documents the following:
- Configuring a Prometheus service to scrape and display metrics from an AIStor Server deployment
- Configuring an alert rule on an MinIO AIStor metric to trigger an AlertManager action
This tutorial uses metrics version 2. You can also use metrics version 3, which is recommened for new deployments. For more information about version 3, see Metrics and alerts.
Configure Prometheus to collect and alert using MinIO AIStor metrics
1) Generate the scrape configuration
Use the mc admin prometheus generate command to generate the scrape configuration for use by Prometheus in making scraping requests:
AIStor Server deployment
The following command scrapes metrics for the MinIO AIStor deployment.
mc admin prometheus generate ALIAS
Replace ALIAS with the alias of the deployment.
The command returns output similar to the following:
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/cluster
scheme: https
static_configs:
- targets: [aistor.example.net]
Nodes
The following command scrapes metrics for a node on the MinIO AIStor deployment.
mc admin prometheus generate ALIAS node
Replace ALIAS with the alias of the deployment.
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job-node
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/node
scheme: https
static_configs:
- targets: [aistor-1.example.net, aistor-2.example.net, aistor-N.example.net]
Buckets
The following command scrapes metrics for buckets on an Object Store.
mc admin prometheus generate ALIAS bucket
Replace ALIAS with the alias of the MinIO AIStor deployment.
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job-bucket
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/bucket
scheme: https
static_configs:
- targets: [aistor.example.net]
Resources
The following command scrapes metrics for resources on the MinIO AIStor deployment.
mc admin prometheus generate ALIAS resource
Replace ALIAS with the alias of the deployment.
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job-resource
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/resource
scheme: https
static_configs:
- targets: [aistor.example.net]
-
Set an appropriate
scrape_intervalvalue to ensure each scraping operation completes before the next one begins. The recommended value is 60 seconds.Some deployments require a longer scrape interval due to the number of metrics being scraped. To reduce the load on your MinIO AIStor and Prometheus servers, choose the longest interval that meets your monitoring requirements.
-
Set the
job_nameto a value associated to the MinIO AIStor deployment.Use a unique value to ensure isolation of the deployment metrics from any others collected by that Prometheus service.
-
MinIO AIStor deployments started with
MINIO_PROMETHEUS_AUTH_TYPEset to"public"can omit thebearer_tokenfield. -
Set the
schemeto http for MinIO AIStor deployments not using TLS. -
Set the
targetsarray with a hostname that resolves to the MinIO AIStor deployment.This can be any single node, or a load balancer/proxy which handles connections to the MinIO AIStor nodes.
For MinIO AIStors on Kubernetes infrastructure, when using a Prometheus cluster in that same cluster you can specify the service DNS name for the
minioservice. You can otherwise specify the ingress or load balancer endpoint configured to route connections to and from the MinIO AIStor.
2) Restart Prometheus with the updated configuration
Append the desired scrape_configs job generated in the previous step to the configuration file:
Cluster
Cluster metrics aggregate node-level metrics and, where appropriate, attach labels to metrics for the originating node.
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/cluster
scheme: https
static_configs:
- targets: [aistor.example.net]
Nodes
Node metrics are specific for node-level monitoring. You need to list all MinIO AIStor nodes for this configuration.
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job-node
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/node
scheme: https
static_configs:
- targets: [aistor-1.example.net, aistor-2.example.net, aistor-N.example.net]
Bucket
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job-bucket
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/bucket
scheme: https
static_configs:
- targets: [aistor.example.net]
Resource
global:
scrape_interval: 60s
scrape_configs:
- job_name: aistor-job-resource
bearer_token: TOKEN
metrics_path: /minio/v2/metrics/resource
scheme: https
static_configs:
- targets: [aistor.example.net]
Start the Prometheus cluster using the configuration file:
prometheus --config.file=prometheus.yaml
3) Analyze collected metrics
Prometheus includes an expression browser. You can execute queries here to analyze the collected metrics.
Examples
The following query examples return metrics collected by Prometheus every five minutes for a scrape job named aistor-job:
minio_node_drive_free_bytes{job="aistor-job"}[5m]
minio_node_drive_free_inodes{job="aistor-job"}[5m]
minio_node_drive_latency_us{job="aistor-job"}[5m]
minio_node_drive_offline_total{job="aistor-job"}[5m]
minio_node_drive_online_total{job="aistor-job"}[5m]
minio_node_drive_total{job="aistor-job"}[5m]
minio_node_drive_total_bytes{job="aistor-job"}[5m]
minio_node_drive_used_bytes{job="aistor-job"}[5m]
minio_node_drive_errors_timeout{job="aistor-job"}[5m]
minio_node_drive_errors_availability{job="aistor-job"}[5m]
minio_node_drive_io_waiting{job="aistor-job"}[5m]
Recommended metrics
MinIO recommends the following as a basic set of metrics to monitor.
See Metrics and alerts for information about all available metrics.
| Metric | Description |
|---|---|
minio_node_drive_free_bytes |
Total storage available on a drive. |
minio_node_drive_free_inodes |
Total free inodes. |
minio_node_drive_latency_us |
Average last minute latency in µs for drive API storage operations. |
minio_node_drive_offline_total |
Total drives offline in this node. |
minio_node_drive_online_total |
Total drives online in this node. |
minio_node_drive_total |
Total drives in this node. |
minio_node_drive_total_bytes |
Total storage on a drive. |
minio_node_drive_used_bytes |
Total storage used on a drive. |
minio_node_drive_errors_timeout |
Total number of drive timeout errors since server start. |
minio_node_drive_errors_availability |
Total number of drive I/O errors, permission denied and timeouts since server start. |
minio_node_drive_io_waiting |
Total number of I/O operations waiting on drive. |
4) Configure an alert rule using MinIO AIStor metrics
You must configure Alert rules on the Prometheus deployment to trigger alerts based on collected MinIO AIStor metrics.
The following example alert rule files provide a baseline of alerts for an MinIO AIStor deployment. You can modify or otherwise use these examples as guidance in building your own alerts.
groups:
- name: aistor-alerts
rules:
- alert: NodesOffline
expr: avg_over_time(minio_cluster_nodes_offline_total{job="aistor-job"}[5m]) > 0
for: 10m
labels:
severity: warn
annotations:
summary: "Node down in MinIO AIStor deployment"
description: "Node(s) in cluster {{ $labels.instance }} offline for more than 5 minutes"
- alert: DisksOffline
expr: avg_over_time(minio_cluster_drive_offline_total{job="aistor-job"}[5m]) > 0
for: 10m
labels:
severity: warn
annotations:
summary: "Disks down in MinIO AIStor deployment"
description: "Disks(s) in cluster {{ $labels.instance }} offline for more than 5 minutes"
In the Prometheus configuration, specify the path to the alert file in the rule_files key:
rule_files:
- aistor-alerting.yml
Once triggered, Prometheus sends the alert to the configured AlertManager service.