Disable Unsafe Operations for Your EKS Cluster

Since Rok uses local NVMe disks to store user data, terminating/replacing a node before properly draining it would result in data loss. For example, scaling down the node group directly via the Auto Scaling group will eventually remove the instance after about 15 minutes, regardless of whether a drain operation takes place.

This guide mentions all the necessary actions that you, as the administrator, should do in order to protect your EKS cluster from losing any data. Specifically, it will will walk you through

  • enabling scale-in protection.
  • suspending Auto Scaling processes that would result in a node termination.

Warning

If an EC2 instance (EKS worker node) terminates in an unexpected manner, you will lose data. As such, you should avoid the following actions:

  • Decrement the desired size of the Auto Scaling group.
  • Terminate an EC2 instance directly from the console.
  • Delete a whole node group.

Procedure

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. Restore the required context from previous sections:

    root@rok-tools:~/ops/deployments# source deploy/env.eks-cluster
    root@rok-tools:~/ops/deployments# export EKS_CLUSTER
  3. List the Auto Scaling groups of your EKS cluster:

    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName] \ > --output text eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b eks-aebc1fd1-3b78-8761-606e-ca8502549661
  4. Repeat steps below for each one of the Auto Scaling groups in the list shown above.

    1. Specify the ASG to operate on:

      root@rok-tools:~/ops/deployments# export ASG=<ASG>

      Replace <ASG> with the name of your Auto Scaling group. For example:

      root@rok-tools:~/ops/deployments# export ASG=eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b
    2. Enable scale-in protection at the ASG level for new instances:

      root@rok-tools:~/ops/deployments# aws autoscaling update-auto-scaling-group \ > --auto-scaling-group-name ${ASG?} \ > --new-instances-protected-from-scale-in
    3. Enable scale-in protection at the instance level for existing instances:

      root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].Instances[].InstanceId \ > --output text \ > | xargs aws autoscaling set-instance-protection \ > --auto-scaling-group-name ${ASG} \ > --protected-from-scale-in \ > --instance-ids
    4. Suspend any unsafe Auto Scaling processes:

      root@rok-tools:~/ops/deployments# aws autoscaling suspend-processes \ > --auto-scaling-group-name ${ASG?} \ > --scaling-processes ReplaceUnhealthy AZRebalance InstanceRefresh

      Note

      The Auto Scaling processes above are considered unsafe since they may cause ungraceful node termination. Specifically:

      • ReplaceUnhealthy will automatically replace EC2 instances that their status checks have failed. Now, unhealthy EC2 instances will remain in-service and will now require a manual action. See the Manage Unhealty Instances guide for more information.
      • AZRebalance will automatically rebalance your instances across existing availability zones. Since EKF uses EBS volumes, we recommend using node groups that span a single AZ, so this should not make a difference.
      • InstanceRefresh will perform a rolling replacement of all or some instances in your Auto Scaling group. This might be useful when you want to update the launch template configuration of the node group. However, we do not recommend using this feature for upgrades. Follow the Upgrade EKS Node Group guide instead.
    5. Go back to step a and repeat this process for the remaining Auto Scaling groups.

Verify

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. Restore the required context from previous sections:

    root@rok-tools:~/ops/deployments# source deploy/env.eks-cluster
    root@rok-tools:~/ops/deployments# export EKS_CLUSTER
  3. Ensure that all Auto Scaling groups associated with your cluster have scale-in protection enabled:

    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName,NewInstancesProtectedFromScaleIn] \ > --output text eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b True eks-aebc1fd1-3b78-8761-606e-ca8502549661 True
  4. Ensure that all running Auto Scaling instances of your cluster have scale-in protection enabled:

    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,ProtectedFromScaleIn] \ > --output text i-03696c6a5abe28646 True i-07898559e258823c8 True
  5. Ensure that you have suspended the ReplaceUnhealthy, AZRebalance, and InstanceRefresh Auto Scaling processes for all Auto Scaling groups of your EKS cluster:

    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName,SuspendedProcesses[].ProcessName] \ > --output text | paste - - | column -t eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b ReplaceUnhealthy AZRebalance InstanceRefresh eks-aebc1fd1-3b78-8761-606e-ca8502549661 ReplaceUnhealthy AZRebalance InstanceRefresh

Summary

You have successfully enabled scale-in protection and suspended any unsafe Auto Scaling processes for the Auto Scaling groups of your EKS cluster.

What’s Next

The next step is to install Rok.