Key Manager recovery on Linux

AIStor Key Manager supports recovery from both single-node failures and total cluster failure.

Single node failure and recovery

For single-node failures, Key Manager requires at least one healthy node remaining in the cluster. You can restore any number of failed nodes from a single healthy node, so long as that node remains accessible until total cluster recovery.

This procedure requires running commands as the root user. The following steps remove a failed node from the cluster:

  1. Shut down the failed node and delete all state

    If the node has completely failed with data loss you can skip to the next step.

    Removal of state requires deleting all data at the Key Manager storage path. You do not have to remove any configuration files, certificates, or other resources used by the minkms process.

  2. Remove the failed node from the cluster

    Use the minkms edit command to force removing the node from the cluster. You can use minkms ls against a healthy node in the cluster to retrieve the list of node IDs.

    The following command removes a node with NODE-ID from a cluster where the specified host keymanager1.example.net remains healthy and available to process operations.

    export MINIO_KMS_API_KEY=k1:ROOT_API_KEY
    
    minkms edit https://keymanager1.example.net:7373 --rm NODE-ID
    

    Do not remove more than one failed node at a time with the minkms edit command.

  3. Restart the failed node with fresh state

    Restart the minkms process using a copy of the configuration file from an existing healthy node.

    If the node lost all data, you can follow the installation procedure to reinstall minkms and prepare the process to run.

  4. Add the node back to the cluster

    Use the minkms add command to re-join the node to the cluster. The following example re-adds a new node at keymanager2.example.net to the cluster using a healthy node keymanager1.example.net:

    minkms add https://keymanager1.example.net:7373 --api-key k1:ROOT_API_KEY https://keymanager2.example.net:7373
    
  5. Monitor the cluster state

    Use the minkms stat command to monitor the cluster state and ensure the node rejoins successfully.

Total cluster failure and recovery

You can rebuild a Key Manager cluster from a backup in the event of hardware failure, disaster, or other business continuity event. Key Manager requires creating a new single-node cluster to which you restore the backup snapshot. Once the node successfully starts up and resumes operations, you can scale the cluster back up to the target size.

This procedure requires running commands as the root user. The following steps rebuild a cluster from a backup snapshot:

  1. Start up minkms on a new host

    Follow the installation procedure to install and start minkms on a new host. Ensure that you use the same configuration options as the original cluster, including the same HSM keys used when creating the backup.

    Once the minkms node is online and available, proceed to the next step.

  2. Restore the backup snapshot

    Use the minkms restore command to restore the backup snapshot to the new node. You can use the inline CLI help minkms help restore for additional usage and guidance.

    The following example targets a new host keymanager1.example.net and restores from a snapshot BACKUP-FILE:

    minkms restore https://keymanager1.example.net:7373 --api-key ROOT-API-KEY BACKUP-FILE
    

    Once the backup completes, verify the state of the cluster by running the following commands:

  3. Scale the cluster back to production size

    Follow the scaling procedure to add new nodes to the cluster. Ensure each new node has a clean state before adding it to the existing cluster.

    For example, the following command adds a node at keymanager2.example.net to a cluster using a healthy node keymanager1.example.net:

    minkms add https://keymanager1.example.net:7373 --api-key k1:ROOT_API_KEY https://keymanager2.example.net:7373