S.M.A.R.T. Drive Health Monitoring
MinIO AIStor provides S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) disk health monitoring for proactive detection of drive failures. SMART monitoring collects health telemetry directly from NVMe and SATA drives using native Linux ioctl calls with no external dependencies.
Enabling SMART metrics
Prometheus metrics (v3)
SMART metrics are available at the /system/drive/smart collector path.
See Metrics v3 Reference for the full metric listing.
| Metric | Description |
|---|---|
minio_system_drive_smart_status |
Health status (0=unknown, 1=healthy, 2=warning, 3=critical) |
minio_system_drive_smart_temperature_celsius |
Drive temperature in Celsius |
minio_system_drive_smart_power_on_hours |
Total power-on hours |
minio_system_drive_smart_power_cycles |
Power cycle count |
minio_system_drive_smart_failure_risk |
Estimated annual failure rate (0.0-1.0+) |
minio_system_drive_smart_available_spare_percent |
NVMe available spare capacity |
minio_system_drive_smart_percentage_used |
NVMe percentage of endurance used |
minio_system_drive_smart_media_errors |
NVMe media error count |
minio_system_drive_smart_reallocated_sectors |
SATA reallocated sector count |
minio_system_drive_smart_pending_sectors |
SATA pending sector count |
minio_system_drive_smart_offline_uncorrectable |
SATA offline uncorrectable sectors |
All metrics include labels: drive, pool_index, set_index, drive_index, server.
Admin API (v4)
Request SMART data by adding ?smart=true to the drives query endpoint:
GET /minio/admin/v4/query/drives?smart=true
SMART data is returned in the metrics.smart field of each drive resource.
Health status classification
Drives are classified into three health states:
| Status | Description | Action |
|---|---|---|
healthy |
No concerning indicators | None, drive operating normally |
warning |
Early signs of potential issues | Plan replacement within weeks/months |
critical |
High probability of imminent failure | Replace immediately |
Monitored attributes
NVMe drives
| Attribute | Critical threshold | Description |
|---|---|---|
| Critical Warning | Any bit set | Controller-reported critical condition |
| Available Spare | Below threshold | Remaining spare capacity percentage |
| Percentage Used | > 100% | Endurance consumed (can exceed 100%) |
| Media Errors | > 0 | Uncorrectable media errors |
SATA/SAS drives
| Attribute | Warning | Critical | Annual failure rate |
|---|---|---|---|
| Reallocated Sectors (ID 5) | 1-10 | > 10 | 2.5% - 30%+ |
| Current Pending Sectors (ID 197) | 1-5 | > 5 | 3.5% - 25%+ |
| Offline Uncorrectable (ID 198) | 1 | > 1 | 5% - 35%+ |
| Spin Retry Count (ID 10) | - | > 0 | 15%+ |
| Command Timeout (ID 188) | 1-100 | > 100 | Variable |
Thresholds are based on Google and Backblaze large-scale drive failure studies.
Failure risk score
The failureRisk field provides an estimated annual failure rate (AFR) as a decimal:
0.0- Baseline failure rate (healthy drive)0.025- 2.5% annual failure probability0.30- 30% annual failure probability (critical)
This enables automated alerting and replacement scheduling based on quantified risk.
Permissions
SMART data collection requires elevated privileges:
- Root access, or
- Linux capabilities:
CAP_SYS_ADMINandCAP_DAC_OVERRIDE
If permissions are insufficient, SMART data is omitted from responses without failing the request.
Distributed collection
In multi-node deployments, SMART data is collected from each node via the internal RPC framework. The coordinator node aggregates responses, providing cluster-wide visibility through a single API call.
Best practices
- Enable SMART in monitoring dashboards - Track temperature and health trends over time.
- Alert on warning status - Plan proactive replacements before failure occurs.
- Alert immediately on critical - Drives may fail within days or weeks.
- Monitor temperature - Sustained high temperatures accelerate wear.
- Track reallocated sectors trend - An increasing count indicates active degradation.
Troubleshooting
SMART data missing for some drives
- Verify the MinIO AIStor process has the required capabilities.
- Check that the drive supports SMART (virtual or cloud drives may not).
- Ensure the drive is NVMe or SATA. SAS drives use a different protocol.
Temperature shows 0
- NVMe: Temperature sensor not implemented on the controller.
- SATA: Drive does not report the temperature attribute.
High failure risk but drive seems fine
SMART predicts probability, not certainty. Even healthy-seeming drives with bad sectors have elevated failure rates. Treat as an early warning, not a false positive.