Manage Unhealty Instances

If you have followed guide Disable Unsafe Operations for Your EKS Cluster, it means you have suspended the ReplaceUnhealthy operation. Therefore, even if the Auto Scaling group (ASG) marks an instance as unhealthy, it will remain in service and will require a manual action. This guide will walk you through the manual actions required based on whether the failure was temporary or permanent.

Procedure

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. Restore the required context from previous sections:

    root@rok-tools:~/ops/deployments# source deploy/env.eks-cluster
    root@rok-tools:~/ops/deployments# export EKS_CLUSTER
  3. Inspect the health status of your instances:

    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Unhealthy
  4. Specify the instance to operate on:

    root@rok-tools:~/ops/deployments# export INSTANCE=<INSTANCE>

    Replace <INSTANCE> with the instance ID. For example:

    root@rok-tools:~/ops/deployments# export INSTANCE=i-07898559e258823c8
  5. Choose one of the following options based on whether the failure was temporary or permanent.

    In case of a temporary failure, for example, a system crash that made the system freeze but eventually the node rebooted, the EC2 instance can be considered healthy again and EC2 will report it as such.

    Reset the health status of your instance manually:

    root@rok-tools:~/ops/deployments# aws autoscaling set-instance-health \ > --health-status Healthy \ > --instance-id ${INSTANCE?}

    In case of a permanent failure, for example, a corrupted file system, the node must be replaced.

    Important

    In such cases, you will lose data. If you have set up snapshot policies for backup, then you will be able to go back in time and restore your volumes from the latest available snapshot.

    Terminate your instance:

    root@rok-tools:~/ops/deployments# aws autoscaling terminate-instance-in-auto-scaling-group \ > --no-should-decrement-desired-capacity \ > --instance-id ${INSTANCE?}

    The ASG will then see that the desired capacity is greater than the actual size and will create a new instance.

Verify

  1. Ensure that all instances associated with your cluster are InService and Healthy:

    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Healthy

Summary

You have successfully managed the unhealthy instances of your EKS cluster.

What’s Next

Check out the rest of the EKS maintenance operations that you can perform on your cluster.