Scale Out EKS Cluster¶
If a Pod becomes unschedulable due to insufficient resources (CPU, RAM), the Cluster Autoscaler will automatically trigger a scale-out, that is, it will increase the desired size of the ASG, and eventually, it will add a new Kubernetes node.
EKF supports automatic scaling operations on the Kubernetes cluster using a modified version of the Cluster Autoscaler and a custom Scheduler that supports storage capacity tracking for Rok volumes.
This guide will walk you through manually scaling out your EKS cluster, by resizing the underlying node groups.
Overview
What You’ll need¶
- A configured management environment.
- An existing EKS cluster.
- A working Cluster Autoscaler.
- One or more managed or self-managed node groups.
Procedure¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context from previous sections:
root@rok-tools:~/ops/deployments# source deploy/env.eks-clusterroot@rok-tools:~/ops/deployments# export EKS_CLUSTERList the node groups of your EKS cluster. Choose one of the following options based on your node group type.
root@rok-tools:~/ops/deployments# aws eks list-nodegroups \ > --cluster-name ${EKS_CLUSTER?} \ > --query nodegroups[] \ > --output text \ > | xargs -n1 general-workers gpu-workersroot@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName] \ > --output text arrikto-cluster-general-workers-NodeGroup-1R0C671TNUV2L arrikto-cluster-gpu-workers-NodeGroup-1VK2KMJZQK45TSpecify the node group you want to scale out. Choose one of the following options based on your node group type.
Select a node group from the list shown above:
root@rok-tools:~/ops/deployments# export NODEGROUP=<NODEGROUP>Replace
<NODEGROUP>
with the node group name. For example:root@rok-tools:~/ops/deployments# export NODEGROUP=general-workersSelect a node group from the list shown above:
root@rok-tools:~/ops/deployments# export ASG=<ASG>Replace
<ASG>
with the Auto Scaling group name. For example:root@rok-tools:~/ops/deployments# export ASG=arrikto-cluster-general-workers-NodeGroup-1R0C671TNUV2LInspect the current scaling configuration of your node group. Choose based on your node group type.
Inspect the scaling configuration details:
root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig { "minSize": 0, "maxSize": 3, "desiredSize": 1 }Obtain the current max size:
root@rok-tools:~/ops/deployments# export MAX=$(aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig.maxSize)
Inspect the scaling configuration details:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].[MinSize,MaxSize,DesiredCapacity] \ > --output text 0 1 1Obtain the current max size:
root@rok-tools:~/ops/deployments# export MAX=$(aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].MaxSize \ > --output text)
Specify the new desired size:
root@rok-tools:~/ops/deployments# export DESIRED=<SIZE>Replace
<SIZE>
with the desired number of nodes. For example:root@rok-tools:~/ops/deployments# export DESIRED=3Specify the new max size so that it is greater than or equal to the new desired size:
root@rok-tools:~/ops/deployments# MAX=$(( DESIRED > MAX ? DESIRED : MAX ))Update the scaling config of your node group. Choose based on your node group type.
root@rok-tools:~/ops/deployments# aws eks update-nodegroup-config \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --scaling-config maxSize=${MAX?},desiredSize=${DESIRED?}Troubleshooting
InvalidParameterException
The command fails with:
An error occurred (InvalidParameterException) when calling the UpdateNodegroupConfig operation: desired capacity 4 can't be greater than max size 3Make sure the desired size is less than or equal to the maximum size.
root@rok-tools:~/ops/deployments# aws autoscaling update-auto-scaling-group \ > --auto-scaling-group-name ${ASG?} \ > --desired-capacity ${DESIRED?} \ > --max-size ${MAX?}Troubleshooting
ValidationError
The command fails with:
An error occurred (ValidationError) when calling the UpdateAutoScalingGroup operation: Desired capacity:2 must be between the specified min size:0 and max size:1Make sure the desired size is less than or equal to the maximum size.
Verify¶
Ensure that your node group has scaled out. Choose based on your node group type.
Ensure that your node group is ACTIVE:
root@rok-tools:~# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.status \ > --output text ACTIVEEnsure that your AutoScaling group reports all of its instances as InService and Healthy:
root@rok-tools:~# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Healthy i-0f992f0b02d777900 InService HealthyEnsure that your node group has the expected size. Choose based on your node group type.
root@rok-tools:~# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig.desiredSize 3root@rok-tools:~# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].DesiredCapacity \ > --output text 3Ensure that your Kubernetes cluster has scaled out, and the new nodes have joined the cluster:
root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION ip-192-168-147-137.eu-central-1.compute.internal Ready <none> 113m v1.21.5-eks-bc4871b ip-192-168-157-224.eu-central-1.compute.internal Ready <none> 113m v1.21.5-eks-bc4871b ip-192-168-164-184.eu-central-1.compute.internal Ready <none> 1m v1.21.5-eks-bc4871b
What’s Next¶
Check out the rest of the EKS maintenance operations that you can perform on your cluster.