S.M.A.R.T. Drive Health Monitoring

MinIO AIStor provides S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) disk health monitoring for proactive detection of drive failures. SMART monitoring collects health telemetry directly from NVMe and SATA drives using native Linux ioctl calls with no external dependencies.

Enabling SMART metrics

Prometheus metrics (v3)

SMART metrics are available at the /system/drive/smart collector path. See Metrics v3 Reference for the full metric listing.

Metric Description
minio_system_drive_smart_status Health status (0=unknown, 1=healthy, 2=warning, 3=critical)
minio_system_drive_smart_temperature_celsius Drive temperature in Celsius
minio_system_drive_smart_power_on_hours Total power-on hours
minio_system_drive_smart_power_cycles Power cycle count
minio_system_drive_smart_failure_risk Estimated annual failure rate (0.0-1.0+)
minio_system_drive_smart_available_spare_percent NVMe available spare capacity
minio_system_drive_smart_percentage_used NVMe percentage of endurance used
minio_system_drive_smart_media_errors NVMe media error count
minio_system_drive_smart_reallocated_sectors SATA reallocated sector count
minio_system_drive_smart_pending_sectors SATA pending sector count
minio_system_drive_smart_offline_uncorrectable SATA offline uncorrectable sectors

All metrics include labels: drive, pool_index, set_index, drive_index, server.

Admin API (v4)

Request SMART data by adding ?smart=true to the drives query endpoint:

GET /minio/admin/v4/query/drives?smart=true

SMART data is returned in the metrics.smart field of each drive resource.

Health status classification

Drives are classified into three health states:

Status Description Action
healthy No concerning indicators None, drive operating normally
warning Early signs of potential issues Plan replacement within weeks/months
critical High probability of imminent failure Replace immediately

Monitored attributes

NVMe drives

Attribute Critical threshold Description
Critical Warning Any bit set Controller-reported critical condition
Available Spare Below threshold Remaining spare capacity percentage
Percentage Used > 100% Endurance consumed (can exceed 100%)
Media Errors > 0 Uncorrectable media errors

SATA/SAS drives

Attribute Warning Critical Annual failure rate
Reallocated Sectors (ID 5) 1-10 > 10 2.5% - 30%+
Current Pending Sectors (ID 197) 1-5 > 5 3.5% - 25%+
Offline Uncorrectable (ID 198) 1 > 1 5% - 35%+
Spin Retry Count (ID 10) - > 0 15%+
Command Timeout (ID 188) 1-100 > 100 Variable

Thresholds are based on Google and Backblaze large-scale drive failure studies.

Failure risk score

The failureRisk field provides an estimated annual failure rate (AFR) as a decimal:

  • 0.0 - Baseline failure rate (healthy drive)
  • 0.025 - 2.5% annual failure probability
  • 0.30 - 30% annual failure probability (critical)

This enables automated alerting and replacement scheduling based on quantified risk.

Permissions

SMART data collection requires elevated privileges:

  • Root access, or
  • Linux capabilities: CAP_SYS_ADMIN and CAP_DAC_OVERRIDE

If permissions are insufficient, SMART data is omitted from responses without failing the request.

Distributed collection

In multi-node deployments, SMART data is collected from each node via the internal RPC framework. The coordinator node aggregates responses, providing cluster-wide visibility through a single API call.

Best practices

  1. Enable SMART in monitoring dashboards - Track temperature and health trends over time.
  2. Alert on warning status - Plan proactive replacements before failure occurs.
  3. Alert immediately on critical - Drives may fail within days or weeks.
  4. Monitor temperature - Sustained high temperatures accelerate wear.
  5. Track reallocated sectors trend - An increasing count indicates active degradation.

Troubleshooting

SMART data missing for some drives

  • Verify the MinIO AIStor process has the required capabilities.
  • Check that the drive supports SMART (virtual or cloud drives may not).
  • Ensure the drive is NVMe or SATA. SAS drives use a different protocol.

Temperature shows 0

  • NVMe: Temperature sensor not implemented on the controller.
  • SATA: Drive does not report the temperature attribute.

High failure risk but drive seems fine

SMART predicts probability, not certainty. Even healthy-seeming drives with bad sectors have elevated failure rates. Treat as an early warning, not a false positive.