Key Manager recovery on Kubernetes

AIStor Key Manager supports recovery from both single-node failures and total cluster failure.

Single node failure and recovery

For single-node failures, Key Manager requires at least one healthy node remaining in the cluster. You can restore any number of failed nodes from a single healthy node, so long as that node remains accessible until total cluster recovery.

On Kubernetes a failed node typically presents as a pod that does not spin up or has lost state due to underlying issues with a Persistent Volume. To restore the pod, you must modify the Helm chart to change the replica configuration and remove the downed pods.

  1. Validate the current Helm chart configuration.

    Use the helm get values RELEASE command to retrieve the user-specified values.yaml applied to the chart. You can alternatively reference the actual file if saved or stored in an accessible location.

    Check the keyManager.replicas field:

    keyManager:
    # Other configurations omitted
      replicas: 3
    
  2. Modify the Helm chart to scale down the replica set

    Modify the replicas value to reflect only the pods still online or healthy in the replica set. Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status before proceeding.

    keyManager:
      replicas: 2
    
  3. Update the Helm chart

    Use the helm upgrade command to apply the modified configuration to the release.

    helm upgrade RELEASE minio/aistor-keymanager \
      -n KEY-MANAGER-NAMESPACE \
      -f aistor-keymanager-values.yaml
    

    Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status of the pods after updating the chart. Only the healthy pods should remain online and accessible.

    Use minkms stat to ensure the cluster state reflects only the currently healthy nodes.

  4. Restore the unhealthy worker nodes

    Perform the necessary operations to repair the worker nodes and associated storage infrastructure such that Kubernetes can successfully schedule and run Key Manager pods on those nodes.

    Check and clean any Persistent Volumes previously used by the Key Manager pods such that they contain no data. Depending on your configured storage class and choice of CSI, you may need to take additional steps to clean and present the Persistent Volumes for use.

  5. Scale the replica set to normal size

    Restore values.yaml to the previous values and update the chart:

    helm upgrade RELEASE minio/aistor-keymanager \
      -n KEY-MANAGER-NAMESPACE \
      -f aistor-keymanager-values.yaml
    

    Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status of the pods after updating the chart. Only the healthy pods should remain online and accessible.

    Use minkms stat to ensure the cluster state reflects only the currently healthy nodes.

Total cluster failure and recovery

You can rebuild a Key Manager cluster from a backup in the event of hardware failure, disaster, or other business continuity events. Key Manager requires creating a new single-node cluster to which you restore the backup snapshot. Once the node successfully starts up and resumes operations, you can scale the cluster back up to the target size.

In Kubernetes you must first deploy a new Key Manager cluster with a single replica. You can then restore the cluster state and scale up to full size. You must ensure that the Kubernetes cluster has available worker nodes and associated storage to schedule all required Key Manager pods.

Backup Required
This procedure wipes all existing state, including any Persistent Volumes previously used by the cluster. Ensure a valid backup exists before deleting the PVs or open a SUBNET for further guidance.
  1. Validate the current Helm chart configuration.

    Use the helm get values RELEASE command to retrieve the user-specified values.yaml applied to the chart. You can alternatively reference the actual file if saved or stored in an accessible location.

    Check the keyManager.replicas field:

    keyManager:
    # Other configurations omitted
      replicas: 3
    
  2. Modify the Helm chart to scale down the replica set to 0

    Modify the replicas value to 0 to delete all pods and their state: Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status before proceeding.

    keyManager:
      replicas: 0
    
  3. Update the Helm chart

    Use the helm upgrade command to apply the modified configuration to the release.

    helm upgrade RELEASE minio/aistor-keymanager \
      -n KEY-MANAGER-NAMESPACE \
      -f aistor-keymanager-values.yaml
    

    Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status of the pods after updating the chart. No pods should remain online.

  4. Restore the unhealthy worker nodes

    Perform the necessary operations to repair the worker nodes and associated storage infrastructure such that Kubernetes can successfully schedule and run Key Manager pods on those nodes.

    Check and clean any Persistent Volumes previously used by the Key Manager pods such that they contain no data. Depending on your configured storage class and choice of CSI, you may need to take additional steps to clean and present the Persistent Volumes for use.

  5. Scale the replica set to 1

    Change the keyManager.replicas field to 1 and update the chart:

    helm upgrade RELEASE minio/aistor-keymanager \
      -n KEY-MANAGER-NAMESPACE \
      -f aistor-keymanager-values.yaml
    

    Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status of the pods after updating the chart. Only the healthy pods should remain online and accessible.

    Use minkms stat to ensure the cluster state reflects only the currently healthy nodes.

  6. Restore the backup snapshot

    Use the minkms restore command to restore the backup snapshot to the new node. You can use the inline CLI help minkms help restore for additional usage and guidance.

    The following example targets a new host keymanager1.example.net and restores from a snapshot BACKUP-FILE. The example assumes the Key Manager cluster includes an ingress, route, or similar configuration that exposes the node or service at the specified hostname:

    minkms restore https://keymanager1.example.net:7373 --api-key ROOT-API-KEY BACKUP-FILE
    

    Once the backup completes, verify the state of the cluster by running the following commands:

  7. Scale the replica set to normal size

    Restore values.yaml to the previous values and update the chart:

    helm upgrade RELEASE minio/aistor-keymanager \
      -n KEY-MANAGER-NAMESPACE \
      -f aistor-keymanager-values.yaml
    

Use kubectl get all -n KEY-MANAGER-NAMESPACE to validate the status of the pods after updating the chart. Only the healthy pods should remain online and accessible.

Use minkms stat to ensure the cluster state reflects only the currently healthy nodes.