Scale Out EKS Cluster

If a Pod becomes unschedulable due to insufficient resources (CPU, RAM), the Cluster Autoscaler will automatically trigger a scale-out, that is, it will increase the desired size of the ASG, and eventually, it will add a new Kubernetes node.

EKF supports automatic scaling operations on the Kubernetes cluster using a modified version of the Cluster Autoscaler and a custom Scheduler that supports storage capacity tracking for Rok volumes.

This guide will walk you through manually scaling out your EKS cluster, by resizing the underlying node groups.

What You’ll need

Procedure

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. Restore the required context from previous sections:

    root@rok-tools:~/ops/deployments# source deploy/env.eks-cluster
    root@rok-tools:~/ops/deployments# export EKS_CLUSTER
  3. List the node groups of your EKS cluster. Choose one of the following options based on your node group type.

    root@rok-tools:~/ops/deployments# aws eks list-nodegroups \ > --cluster-name ${EKS_CLUSTER?} \ > --query nodegroups[] \ > --output text \ > | xargs -n1 general-workers gpu-workers
    root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName] \ > --output text arrikto-cluster-general-workers-NodeGroup-1R0C671TNUV2L arrikto-cluster-gpu-workers-NodeGroup-1VK2KMJZQK45T
  4. Specify the node group you want to scale out. Choose one of the following options based on your node group type.

    Select a node group from the list shown above:

    root@rok-tools:~/ops/deployments# export NODEGROUP=<NODEGROUP>

    Replace <NODEGROUP> with the node group name. For example:

    root@rok-tools:~/ops/deployments# export NODEGROUP=general-workers

    Select a node group from the list shown above:

    root@rok-tools:~/ops/deployments# export ASG=<ASG>

    Replace <ASG> with the Auto Scaling group name. For example:

    root@rok-tools:~/ops/deployments# export ASG=arrikto-cluster-general-workers-NodeGroup-1R0C671TNUV2L
  5. Inspect the current scaling configuration of your node group. Choose based on your node group type.

    1. Inspect the scaling configuration details:

      root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig { "minSize": 0, "maxSize": 3, "desiredSize": 1 }
    2. Obtain the current max size:

      root@rok-tools:~/ops/deployments# export MAX=$(aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig.maxSize)
    1. Inspect the scaling configuration details:

      root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].[MinSize,MaxSize,DesiredCapacity] \ > --output text 0 1 1
    2. Obtain the current max size:

      root@rok-tools:~/ops/deployments# export MAX=$(aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].MaxSize \ > --output text)
  6. Specify the new desired size:

    root@rok-tools:~/ops/deployments# export DESIRED=<SIZE>

    Replace <SIZE> with the desired number of nodes. For example:

    root@rok-tools:~/ops/deployments# export DESIRED=3
  7. Specify the new max size so that it is greater than or equal to the new desired size:

    root@rok-tools:~/ops/deployments# MAX=$(( DESIRED > MAX ? DESIRED : MAX ))
  8. Update the scaling config of your node group. Choose based on your node group type.

    root@rok-tools:~/ops/deployments# aws eks update-nodegroup-config \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --scaling-config maxSize=${MAX?},desiredSize=${DESIRED?}

    Troubleshooting

    InvalidParameterException

    The command fails with:

    An error occurred (InvalidParameterException) when calling the UpdateNodegroupConfig operation: desired capacity 4 can't be greater than max size 3

    Make sure the desired size is less than or equal to the maximum size.

    root@rok-tools:~/ops/deployments# aws autoscaling update-auto-scaling-group \ > --auto-scaling-group-name ${ASG?} \ > --desired-capacity ${DESIRED?} \ > --max-size ${MAX?}

    Troubleshooting

    ValidationError

    The command fails with:

    An error occurred (ValidationError) when calling the UpdateAutoScalingGroup operation: Desired capacity:2 must be between the specified min size:0 and max size:1

    Make sure the desired size is less than or equal to the maximum size.

Verify

  1. Ensure that your node group has scaled out. Choose based on your node group type.

    Ensure that your node group is ACTIVE:

    root@rok-tools:~# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.status \ > --output text ACTIVE

    Ensure that your AutoScaling group reports all of its instances as InService and Healthy:

    root@rok-tools:~# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Healthy i-0f992f0b02d777900 InService Healthy
  2. Ensure that your node group has the expected size. Choose based on your node group type.

    root@rok-tools:~# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig.desiredSize 3
    root@rok-tools:~# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].DesiredCapacity \ > --output text 3
  3. Ensure that your Kubernetes cluster has scaled out, and the new nodes have joined the cluster:

    root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION ip-192-168-147-137.eu-central-1.compute.internal Ready <none> 113m v1.22.12-eks-ba74326 ip-192-168-157-224.eu-central-1.compute.internal Ready <none> 113m v1.22.12-eks-ba74326 ip-192-168-164-184.eu-central-1.compute.internal Ready <none> 1m v1.22.12-eks-ba74326

Summary

You have successfully scaled out your EKS cluster.

What’s Next

Check out the rest of the EKS maintenance operations that you can perform on your cluster.