Data Recovery
Distributed AIStor deployments rely on Erasure Coding to provide built-in tolerance for multiple drive or node failures. Depending on the deployment topology and the selected erasure code parity, AIStor can tolerate the loss of up to half the drives or nodes in the deployment while maintaining read access (“read quorum”) to objects.
The following table lists the typical types of failure in an AIStor deployment and links to procedures for recovering from each:
Failure Type | Description |
---|---|
Drive Failure | AIStor supports hot-swapping failed drives with new healthy drives. |
Node Failure | AIStor detects when a node rejoins the deployment and begins proactively healing the node shortly after it is joined back to the cluster healing data previously stored on that node. |
Site Failure | AIStor Site Replication supports complete resynchronization of buckets, objects, and replication-eligible configuration settings after total site loss. |
Since AIStor can operate in a degraded state without significant performance loss, administrators can schedule hardware replacement in proportion to the rate of hardware failure. “Normal” failure rates (single drive or node failure) may allow for a more reasonable replacement timeframe, while “critical” failure rates (multiple drives or nodes) may require a faster response.
Sometimes nodes contain one or more drives that are either partially failed or operating in a degraded state. Symptoms of such drives may include increasing drive errors, SMART warnings, timeouts in AIStor logs, and so on. You can safely unmount an unhealthy drive if the cluster has sufficient remaining healthy drives to maintain read and write quorum. Missing drives are less disruptive to the deployment than drives that are consistently producing read and write errors.
If multiple drives within the same erasure set require healing, AIStor prioritizes the first drive to begin healing within the erasure set. Other drives in the erasure set wait until healing finishes on the first drive before the next drive starts healing.
This allows drives that begin healing to complete the process as quickly as possible.