Monitoring and alerting using Prometheus

AIStor Server publishes cluster, node, bucket, and resource metrics using the Prometheus Data Model. The procedure on this page documents the following:

  • Configuring a Prometheus service to scrape and display metrics from an AIStor Server deployment
  • Configuring an alert rule on an MinIO AIStor metric to trigger an AlertManager action

This tutorial uses metrics version 2. You can also use metrics version 3, which is recommened for new deployments. For more information about version 3, see Metrics and alerts.

Configure Prometheus to collect and alert using MinIO AIStor metrics

1) Create a dedicated access key for Prometheus

mc admin prometheus generate signs the bearer token locally using the credentials of the mc alias you point it at. Whatever access key signs the token is the identity AIStor Server uses to authorize scrape requests.

Create a dedicated access key whose only permitted action is scraping metrics, then use that access key to generate the scrape configuration. Do not use root credentials or an administrative user for this purpose.

Save the following policy document to a file, for example prometheus-scrape.json:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "admin:Prometheus"
            ]
        }
    ]
}

The admin:Prometheus action is the only permission required to scrape the metrics endpoints. The policy grants nothing else, so the resulting bearer token cannot read or modify object data, manage users, or perform any other administrative operation.

Use mc admin accesskey create to attach the policy inline to a new access key:

mc admin accesskey create ALIAS/                       \
    --name "prometheus-scrape"                         \
    --description "Used by Prometheus to scrape metrics" \
    --policy /path/to/prometheus-scrape.json
  • Replace ALIAS with the alias of a user with permission to create access keys.
  • The command outputs an AccessKey and SecretKey. Record both values; the secret key is shown only once.
The inline policy attached to an access key cannot grant access to any action the parent user does not already have. Create the access key from an identity that has admin:Prometheus (or admin:*).

Add a separate mc alias for the new access key. This isolates the Prometheus credentials from your administrative alias:

mc alias set myaistor-prometheus https://aistor.example.net PROMETHEUS_ACCESS_KEY PROMETHEUS_SECRET_KEY

Use this alias in the next step when generating the scrape configuration.

2) Generate the scrape configuration

Use the mc admin prometheus generate command, pointed at the alias for the dedicated Prometheus access key, to generate the scrape configuration for use by Prometheus in making scraping requests. The bearer token in the output is signed with that access key and only authorizes the admin:Prometheus action.

AIStor Server deployment

The following command scrapes metrics for the MinIO AIStor deployment.

mc admin prometheus generate ALIAS

Replace ALIAS with the alias of the deployment.

The command returns output similar to the following:

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/cluster
     scheme: https
     static_configs:
     - targets: [aistor.example.net]

Nodes

The following command scrapes metrics for a node on the MinIO AIStor deployment.

mc admin prometheus generate ALIAS node

Replace ALIAS with the alias of the deployment.

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job-node
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/node
     scheme: https
     static_configs:
     - targets: [aistor-1.example.net, aistor-2.example.net, aistor-N.example.net]

Buckets

The following command scrapes metrics for buckets on an Object Store.

mc admin prometheus generate ALIAS bucket

Replace ALIAS with the alias of the MinIO AIStor deployment.

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job-bucket
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/bucket
     scheme: https
     static_configs:
     - targets: [aistor.example.net]

Resources

The following command scrapes metrics for resources on the MinIO AIStor deployment.

mc admin prometheus generate ALIAS resource

Replace ALIAS with the alias of the deployment.

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job-resource
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/resource
     scheme: https
     static_configs:
     - targets: [aistor.example.net]
  • Set an appropriate scrape_interval value to ensure each scraping operation completes before the next one begins. The recommended value is 60 seconds.

    Some deployments require a longer scrape interval due to the number of metrics being scraped. To reduce the load on your MinIO AIStor and Prometheus servers, choose the longest interval that meets your monitoring requirements.

  • Set the job_name to a value associated to the MinIO AIStor deployment.

    Use a unique value to ensure isolation of the deployment metrics from any others collected by that Prometheus service.

  • MinIO AIStor deployments started with MINIO_PROMETHEUS_AUTH_TYPE set to "public" can omit the bearer_token field.

  • Set the scheme to http for MinIO AIStor deployments not using TLS.

  • Set the targets array with a hostname that resolves to the MinIO AIStor deployment.

    This can be any single node, or a load balancer/proxy which handles connections to the MinIO AIStor nodes.

    For MinIO AIStors on Kubernetes infrastructure, when using a Prometheus cluster in that same cluster you can specify the service DNS name for the minio service. You can otherwise specify the ingress or load balancer endpoint configured to route connections to and from the MinIO AIStor.

3) Restart Prometheus with the updated configuration

Append the desired scrape_configs job generated in the previous step to the configuration file:

Cluster

Cluster metrics aggregate node-level metrics and, where appropriate, attach labels to metrics for the originating node.

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/cluster
     scheme: https
     static_configs:
     - targets: [aistor.example.net]

Nodes

Node metrics are specific for node-level monitoring. You need to list all MinIO AIStor nodes for this configuration.

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job-node
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/node
     scheme: https
     static_configs:
     - targets: [aistor-1.example.net, aistor-2.example.net, aistor-N.example.net]

Bucket

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job-bucket
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/bucket
     scheme: https
     static_configs:
     - targets: [aistor.example.net]

Resource

global:
   scrape_interval: 60s

scrape_configs:
   - job_name: aistor-job-resource
     bearer_token: TOKEN
     metrics_path: /minio/v2/metrics/resource
     scheme: https
     static_configs:
     - targets: [aistor.example.net]

Start the Prometheus cluster using the configuration file:

prometheus --config.file=prometheus.yaml

4) Analyze collected metrics

Prometheus includes an expression browser. You can execute queries here to analyze the collected metrics.

Examples

The following query examples return metrics collected by Prometheus every five minutes for a scrape job named aistor-job:

minio_node_drive_free_bytes{job="aistor-job"}[5m]
minio_node_drive_free_inodes{job="aistor-job"}[5m]

minio_node_drive_latency_us{job="aistor-job"}[5m]

minio_node_drive_offline_total{job="aistor-job"}[5m]
minio_node_drive_online_total{job="aistor-job"}[5m]

minio_node_drive_total{job="aistor-job"}[5m]

minio_node_drive_total_bytes{job="aistor-job"}[5m]
minio_node_drive_used_bytes{job="aistor-job"}[5m]

minio_node_drive_errors_timeout{job="aistor-job"}[5m]
minio_node_drive_errors_availability{job="aistor-job"}[5m]

minio_node_drive_io_waiting{job="aistor-job"}[5m]

MinIO recommends the following as a basic set of metrics to monitor.

See Metrics and alerts for information about all available metrics.

Metric Description
minio_node_drive_free_bytes Total storage available on a drive.
minio_node_drive_free_inodes Total free inodes.
minio_node_drive_latency_us Average last minute latency in µs for drive API storage operations.
minio_node_drive_offline_total Total drives offline in this node.
minio_node_drive_online_total Total drives online in this node.
minio_node_drive_total Total drives in this node.
minio_node_drive_total_bytes Total storage on a drive.
minio_node_drive_used_bytes Total storage used on a drive.
minio_node_drive_errors_timeout Total number of drive timeout errors accumulated over the lifetime of the server. Persists across server restarts.
minio_node_drive_errors_availability Total number of drive I/O errors, permission denied and timeouts accumulated over the lifetime of the server. Persists across server restarts.
minio_node_drive_io_waiting Total number of I/O operations waiting on drive.

5) Configure an alert rule using MinIO AIStor metrics

You must configure Alert rules on the Prometheus deployment to trigger alerts based on collected MinIO AIStor metrics.

The following example alert rule files provide a baseline of alerts for an MinIO AIStor deployment. You can modify or otherwise use these examples as guidance in building your own alerts.

groups:
- name: aistor-alerts
  rules:
  - alert: NodesOffline
    expr: avg_over_time(minio_cluster_nodes_offline_total{job="aistor-job"}[5m]) > 0
    for: 10m
    labels:
      severity: warn
    annotations:
      summary: "Node down in MinIO AIStor deployment"
      description: "Node(s) in cluster {{ $labels.instance }} offline for more than 5 minutes"

  - alert: DisksOffline
    expr: avg_over_time(minio_cluster_drive_offline_total{job="aistor-job"}[5m]) > 0
    for: 10m
    labels:
      severity: warn
    annotations:
      summary: "Disks down in MinIO AIStor deployment"
      description: "Disks(s) in cluster {{ $labels.instance }} offline for more than 5 minutes"

In the Prometheus configuration, specify the path to the alert file in the rule_files key:

rule_files:
- aistor-alerting.yml

Once triggered, Prometheus sends the alert to the configured AlertManager service.

6) Visualize metrics in Grafana

You can visualize the collected metrics in Grafana using the MinIO AIStor dashboards. MinIO ships ready-to-import Grafana dashboards in JSON format for the most common monitoring needs:

Dashboard Description
Customer operations Essential deployment health in a single view: cluster, node, and drive status, capacity trends, request and error rates, and object growth. Recommended starting point.
Node Node-specific metrics for drive health, storage usage, latency, and errors.
Bucket Per-bucket metrics including S3 API operations, replication status, object distribution, and resource usage.
Replication Node-level and cluster-level replication activity, including worker counts, transfer rates, queue depths, and failures.

The MinIO AIStor dashboard is also published on the Grafana dashboard portal, which you can import directly by ID.

To import a dashboard:

  1. In Grafana, select Dashboards > Import.
  2. Upload the dashboard JSON file or paste its contents. For the published dashboard, enter the dashboard ID 13502 instead.
  3. Select the Prometheus data source that scrapes the MinIO AIStor metrics, then complete the import.
  4. If the dashboard uses a job variable, set it to the job_name you configured in the scrape configuration, for example aistor-job.
These dashboards are provided as starting points. Adjust the time ranges, refresh intervals, and alert thresholds to match your monitoring requirements.