S.M.A.R.T. Drive Health Monitoring

MinIO AIStor provides S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) disk health monitoring for proactive detection of drive failures. SMART monitoring collects health telemetry directly from NVMe and SATA drives using native Linux ioctl calls with no external dependencies.

Enabling SMART metrics

Prometheus metrics (v3)

SMART metrics are available at the /system/drive/smart collector path. See Metrics v3 Reference for the full metric listing.

Metric	Description
`minio_system_drive_smart_status`	Health status (0=unknown, 1=healthy, 2=warning, 3=critical)
`minio_system_drive_smart_temperature_celsius`	Drive temperature in Celsius
`minio_system_drive_smart_power_on_hours`	Total power-on hours
`minio_system_drive_smart_power_cycles`	Power cycle count
`minio_system_drive_smart_failure_risk`	Estimated annual failure rate (0.0-1.0+)
`minio_system_drive_smart_available_spare_percent`	NVMe available spare capacity
`minio_system_drive_smart_percentage_used`	NVMe percentage of endurance used
`minio_system_drive_smart_media_errors`	NVMe media error count
`minio_system_drive_smart_reallocated_sectors`	SATA reallocated sector count
`minio_system_drive_smart_pending_sectors`	SATA pending sector count
`minio_system_drive_smart_offline_uncorrectable`	SATA offline uncorrectable sectors

All metrics include labels: drive, pool_index, set_index, drive_index, server.

Admin API (v4)

Request SMART data by adding ?smart=true to the drives query endpoint:

GET /minio/admin/v4/query/drives?smart=true

SMART data is returned in the metrics.smart field of each drive resource.

Health status classification

Drives are classified into three health states:

Status	Description	Action
`healthy`	No concerning indicators	None, drive operating normally
`warning`	Early signs of potential issues	Plan replacement within weeks/months
`critical`	High probability of imminent failure	Replace immediately

Monitored attributes

NVMe drives

Attribute	Critical threshold	Description
Critical Warning	Any bit set	Controller-reported critical condition
Available Spare	Below threshold	Remaining spare capacity percentage
Percentage Used	> 100%	Endurance consumed (can exceed 100%)
Media Errors	> 0	Uncorrectable media errors

SATA/SAS drives

Attribute	Warning	Critical	Annual failure rate
Reallocated Sectors (ID 5)	1-10	> 10	2.5% - 30%+
Current Pending Sectors (ID 197)	1-5	> 5	3.5% - 25%+
Offline Uncorrectable (ID 198)	1	> 1	5% - 35%+
Spin Retry Count (ID 10)	-	> 0	15%+
Command Timeout (ID 188)	1-100	> 100	Variable

Thresholds are based on Google and Backblaze large-scale drive failure studies.

Failure risk score

The failureRisk field provides an estimated annual failure rate (AFR) as a decimal:

0.0 - Baseline failure rate (healthy drive)
0.025 - 2.5% annual failure probability
0.30 - 30% annual failure probability (critical)

This enables automated alerting and replacement scheduling based on quantified risk.

Permissions

SMART data collection requires elevated privileges:

Root access, or
Linux capabilities: CAP_SYS_ADMIN and CAP_DAC_OVERRIDE

If permissions are insufficient, SMART data is omitted from responses without failing the request.

Distributed collection

In multi-node deployments, SMART data is collected from each node via the internal RPC framework. The coordinator node aggregates responses, providing cluster-wide visibility through a single API call.

Best practices

Enable SMART in monitoring dashboards - Track temperature and health trends over time.
Alert on warning status - Plan proactive replacements before failure occurs.
Alert immediately on critical - Drives may fail within days or weeks.
Monitor temperature - Sustained high temperatures accelerate wear.
Track reallocated sectors trend - An increasing count indicates active degradation.

Troubleshooting

SMART data missing for some drives

Verify the MinIO AIStor process has the required capabilities.
Check that the drive supports SMART (virtual or cloud drives may not).
Ensure the drive is NVMe or SATA. SAS drives use a different protocol.

Temperature shows 0

NVMe: Temperature sensor not implemented on the controller.
SATA: Drive does not report the temperature attribute.

High failure risk but drive seems fine

SMART predicts probability, not certainty. Even healthy-seeming drives with bad sectors have elevated failure rates. Treat as an early warning, not a false positive.