Manage Unhealty Instances¶

If you have followed guide Disable Unsafe Operations for Your EKS Cluster, it means you have suspended the ReplaceUnhealthy operation. Therefore, even if the Auto Scaling group (ASG) marks an instance as unhealthy, it will remain in service and will require a manual action. This guide will walk you through the manual actions required based on whether the failure was temporary or permanent.

What You’ll need ¶

A configured management environment.
An existing EKS cluster.

Procedure ¶

Go to your GitOps repository, inside your rok-tools management environment:

root@rok-tools:~# cd ~/ops/deployments
Restore the required context from previous sections:

root@rok-tools:~/ops/deployments# source deploy/env.eks-cluster

root@rok-tools:~/ops/deployments# export EKS_CLUSTER
Inspect the health status of your instances:

root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Unhealthy
Specify the instance to operate on:

root@rok-tools:~/ops/deployments# export INSTANCE=<INSTANCE>

Replace <INSTANCE> with the instance ID. For example:

root@rok-tools:~/ops/deployments# export INSTANCE=i-07898559e258823c8
Choose one of the following options based on whether the failure was temporary or permanent.

Temporary Failure

Permanent Failure

In case of a temporary failure, for example, a system crash that made the system freeze but eventually the node rebooted, the EC2 instance can be considered healthy again and EC2 will report it as such.

Reset the health status of your instance manually:

root@rok-tools:~/ops/deployments# aws autoscaling set-instance-health \ > --health-status Healthy \ > --instance-id ${INSTANCE?}

In case of a permanent failure, for example, a corrupted file system, the node must be replaced.

Important

In such cases, you will lose data. If you have set up snapshot policies for backup, then you will be able to go back in time and restore your volumes from the latest available snapshot.

Terminate your instance:

root@rok-tools:~/ops/deployments# aws autoscaling terminate-instance-in-auto-scaling-group \ > --no-should-decrement-desired-capacity \ > --instance-id ${INSTANCE?}

The ASG will then see that the desired capacity is greater than the actual size and will create a new instance.

Verify ¶

Ensure that all instances associated with your cluster are InService and Healthy:

root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Healthy

Summary ¶

You have successfully managed the unhealthy instances of your EKS cluster.

What’s Next ¶

Check out the rest of the EKS maintenance operations that you can perform on your cluster.

Operations Guide

Previous Next

Manage Unhealty Instances¶

What You’ll need¶

Procedure¶

Verify¶

Summary¶