Data Recovery

Distributed AIStor deployments rely on Erasure Coding to provide built-in tolerance for multiple drive or node failures. Depending on the deployment topology and the selected erasure code parity, AIStor can tolerate the loss of up to half the drives or nodes in the deployment while maintaining read access (“read quorum”) to objects.

The following table lists the typical types of failure in an AIStor deployment and links to procedures for recovering from each:

Failure Type	Description
Drive Failure	AIStor supports hot-swapping failed drives with new healthy drives.
Node Failure	AIStor detects when a node rejoins the deployment and begins proactively healing the node shortly after it is joined back to the cluster healing data previously stored on that node.
Site Failure	AIStor Site Replication supports complete resynchronization of buckets, objects, and replication-eligible configuration settings after total site loss.

Since AIStor can operate in a degraded state without significant performance loss, administrators can schedule hardware replacement in proportion to the rate of hardware failure. “Normal” failure rates (single drive or node failure) may allow for a more reasonable replacement timeframe, while “critical” failure rates (multiple drives or nodes) may require a faster response.

Sometimes nodes contain one or more drives that are either partially failed or operating in a degraded state. Symptoms of such drives may include increasing drive errors, SMART warnings, timeouts in AIStor logs, and so on. You can safely unmount an unhealthy drive if the cluster has sufficient remaining healthy drives to maintain read and write quorum. Missing drives are less disruptive to the deployment than drives that are consistently producing read and write errors.

Changed in RELEASE.2025-02-04T00-52-01Z

If multiple drives within the same erasure set require healing, AIStor prioritizes the first drive to begin healing within the erasure set. Other drives in the erasure set wait until healing finishes on the first drive before the next drive starts healing.

This allows drives that begin healing to complete the process as quickly as possible.