Managing an EKS cluster

Assuming that you have an EKS cluster running Rok already, in this section we will mention how to add additional components, e.g., Cluster Autoscaler, or how to operate on it, e.g., scale the cluster in or do a rolling upgrade.

Enable Logging

To enable Amazon EKS Control Plane Logging we follow the official EKS docs:

$ aws eks update-cluster-config \
>     --name ${CLUSTERNAME?} \
>     --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

To enable Container Insights we follow the official docs:

  1. Edit rok/amazon-cloudwatch/overlays/deploy/patches/configmap.yaml to set the cluster name and region:

    cluster.name: myclustername
    logs.region: us-east-1
    
  2. Commit the change:

    $ git commit -am "Configure fluentd cloudwatch agent"
    
  3. Apply the Fluentd manifests to Kubernetes:

    $ rok-deploy --apply rok/amazon-cloudwatch/overlays/deploy/
    

The logs will be available at CloudWatch.

Autoscaling

This section mentions all the necessary actions that you, as the administrator, should do in order to scale-in and out an EKS cluster gracefully without losing any data.

Warning

If an EC2 instance (EKS worker node) terminates in an unexpected manner, data will be lost. As such, you should avoid the following actions:

  • Decrement the desired size of the ASG.
  • Terminate an EC2 instance directly from the console.
  • Delete a whole nodegroup.

Find ASG

To find the Auto Scaling groups associated with the your EKS cluster based on an EKS specific tag that is mandatory for both managed and self-managed node groups:

$ aws autoscaling describe-auto-scaling-groups | \
>     jq -r '.AutoScalingGroups[] | select(.Tags[] | .Key == "kubernetes.io/cluster/'${CLUSTERNAME?}'" and .Value == "owned") | .AutoScalingGroupName'

Scale-in protection

Scaling down the node group using the ASG can have catastrophic implications, since it does not allow Rok to properly drain the node (and migrate any volumes) before deleting the corresponding EC2 instance. This is described in more details at the Amazon EC2 Auto Scaling instance lifecycle document, where we see that ASG will remove the instance after about 15 minutes, even if the drain operation has not finished.

To prevent that from happening, you need to enable scale-in protection:

  • at the ASG level, i.e., for newly created instances, and
  • at the instance level, i.e., for existing instances.

Since setting the scale-in protection cannot be done via EKS, we will operate directly on the underlying ASG after creating the node group.

First find the Auto Scaling groups associated with your EKS cluster, and then, repeat the following steps for each one of the Auto Scaling groups found:

$ export ASG=<asg>
  1. Check the current configuration wrt scale-in protection at ASG level:

    $ aws autoscaling describe-auto-scaling-groups \
    >     --auto-scaling-group-names $ASG | \
    >     jq -r '.AutoScalingGroups[] | .AutoScalingGroupName, .NewInstancesProtectedFromScaleIn' | \
    >         paste - -
    

    and at instance level:

    $ aws autoscaling describe-auto-scaling-groups \
    >     --auto-scaling-group-names $ASG | \
    >     jq -r '.AutoScalingGroups[].Instances[] | .InstanceId, .ProtectedFromScaleIn' | \
    >         paste - -
    
  2. Enable scale-in protection at ASG level:

    $ aws autoscaling update-auto-scaling-group \
    >    --auto-scaling-group-name $ASG \
    >    --new-instances-protected-from-scale-in
    
  3. Enable scale-in protection at instance level:

    $ aws autoscaling describe-auto-scaling-groups \
    >    --auto-scaling-group-name $ASG | \
    >       jq -r '.AutoScalingGroups[].Instances[].InstanceId' | \
    >          xargs aws autoscaling set-instance-protection \
    >             --auto-scaling-group-name $ASG \
    >             --protected-from-scale-in \
    >             --instance-ids
    

Suspend unsafe ASG scaling processes

Since Rok uses local NVMe disks to store user data, terminating/replacing a node before properly draining it would result to data loss. So, you have to suspend scaling processes that would result to a node termination, i.e., ReplaceUnhealthy, AZRebalance, InstanceRefresh. Suspending the above processes means that:

  • Unhealthy instances, i.e., EC2 instances that their status checks have failed, will remain in-service and will require a manual action. See Manage unhealthy instances for more details.
  • There will be no rebalancing across availability zones. Still, since you create ASG on single-AZ because you make use of EBS volumes, this should not affect you.
  • To refresh all instances, you should perform a rolling update similar to the one you do in case of an EKS Upgrade, i.e., increase the ASG size, drain the old nodes, and let the Cluster Autoscaler remove them.

For more information on the available scaling processes and how to suspend/resume them see official docs.

To disable the aforementioned dangerous operations, given that you already have created your EKS node group, first find the Auto Scaling groups associated with your EKS cluster, and for each ASG found run the following CLI command:

$ aws autoscaling suspend-processes \
>     --auto-scaling-group-name $ASG \
>     --scaling-processes AZRebalance InstanceRefresh ReplaceUnhealthy

Manage unhealthy instances

Since we have suspended the ReplaceUnhealthy operation, if an instance is marked as unhealthy by the ASG, it will remain in-service and will require a manual action.

If there was a temporarily failure, e.g., a system crash, that made the system freeze for a while but eventually the node got rebooted the EC2 instance can be considered healthy again, i.e., EC2 will report it as such. To manually reset the health status of an instance run:

$ aws autoscaling set-instance-health \
>     --health-status Healthy \
>     --instance-id i-123abc45d

Warning

In case the failure is permanent, e.g., corrupted file system, the node must be replaced. In such cases, it helps if the you have set up Snapshot policies for backup so that you restore your volumes from the latest available snapshot. To terminate such an instance run:

$ aws autoscaling terminate-instance-in-auto-scaling-group \
>     --no-should-decrement-desired-capacity \
>     --instance-id i-123abc45d

Cluster Autoscaler

Important

Instead of assigning the policy to the existing Create EKS Node IAM Role, you will create a dedicated IAM role and assign it via a service account, similar to what the latest docs of Cluster Autoscaler suggest.

Configure IAM

  1. Create the necessary policy to allow the Cluster Autoscaler to manage the necessary AWS resources:

    $ aws iam create-policy \
    >     --policy-name ClusterAutoScaler \
    >     --policy-document file://rok/cluster-autoscaler/iam-policy-ca.json
    

    Alternatively, save the JSON policy document provided below or download iam-policy-ca.json and use it locally.

  2. Set the necessary environment variables:

    $ export IAM_ROLE_NAME=cluster-autoscaler-${CLUSTERNAME?}
    $ export IAM_ROLE_DESCRIPTION=ClusterAutoscaler
    $ export IAM_POLICY_NAME=ClusterAutoScaler
    $ export SERVICE_ACCOUNT_NAMESPACE=kube-system
    $ export SERVICE_ACCOUNT_NAME=cluster-autoscaler
    

Associate the IAM Role and Policy with a Kubernetes Service Account, as described in the official IAM Roles for Service Accounts guide:

  1. Obtain the necessary info for the EKS cluster:

    $ export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
    $ export OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTERNAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///")
    
  2. Use the provided trust policy document template:

    And render it to substitute the missing vars:

    $ j2 rok/eks/iamsa-trust.json.j2 -o iam-$IAM_ROLE_NAME-trust.json
    
  3. Commit the formatted JSON file to the local GitOps repository:

    $ git add iam-$IAM_ROLE_NAME-trust.json
    $ git commit -m "Add JSON trust policy document for $IAM_ROLE_NAME"
    
  4. Create the role:

    $ aws iam create-role \
    >     --role-name $IAM_ROLE_NAME \
    >     --assume-role-policy-document file://iam-$IAM_ROLE_NAME-trust.json \
    >     --description "$IAM_ROLE_DESCRIPTION"
    
  5. Attach the desired policy to the created role:

    $ aws iam attach-role-policy \
    >     --role-name $IAM_ROLE_NAME \
    >     --policy-arn=arn:aws:iam::$AWS_ACCOUNT_ID:policy/$IAM_POLICY_NAME
    
  6. Verify:

    $ aws iam get-role --role-name $IAM_ROLE_NAME
    $ aws iam list-attached-role-policies --role-name $IAM_ROLE_NAME
    

Deployment

  1. Specify the IAM role to use by tweaking the ServiceAccount related patch to set the corresponding annotation, i.e., edit rok/cluster-autoscaler/overlays/deploy/patches/sa.yaml:

    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eks-cluster-autoscaler  # <-- Update this line
    
  2. Specify the cluster name to use by tweaking the corresponding Deployment related patch add an extra argument accordingly, i.e., edit rok/cluster-autoscaler/overlays/deploy/patches/deploy.yaml:

    - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/CLUSTERNAME  # <-- Update this line.
    
  3. Commit the above changes:

    $ git commit -am "Configure Cluster Autoscaler"
    
  4. Apply the manifest:

    $ rok-deploy --apply rok/cluster-autoscaler/overlays/deploy
    

Scale-in

Currently, we do not support automatic scale-in, because Rok intentionally places workloads that cannot be migrated on each node where a Rok volume exists, to guard against scale-in operations. In the future, we will extend the Cluster Autoscaler to take that into account, and support automatic scale-in operations.

In order to manually scale-in the cluster, you, as the administrator, should:

  1. Select a Kubernetes node that you want to remove (see Find a scale-in candidate).

  2. Start a drain operation on the selected node:

    $ kubectl drain --ignore-daemonsets --delete-local-data NODE
    
  3. Rok will snapshot the volumes on that node and move them elsewhere, unguard that node and allow the drain operation to complete.

  4. When the drain has finished, the Cluster Autoscaler will see that the node is now empty, and considers it as unneeded.

  5. After a period of time (scale-down-unneeded-time) the Cluster Autoscaler will terminate the EC2 instance and reduce the desired size of the ASG.

Find a scale-in candidate

Normally, Cluster Autoscaler finds a scale-in candidate automatically. In order to find a good candidate manually, you have to:

  1. Pick an underutilized node.
  2. Ensure that you don’t try to scale-in past the ASG’s minSize.
  3. Ensure that existing EBS volumes are reachable from other nodes in the cluster.

Note

If your nodegroups spawn a single AZ only, you can skip any EBS related checks. Note that using a signle AZ per nodegroup is considered best practice (see Cluster Autoscaler docs and this Amazon blog for more info).

To find a scale-in candidate that covers the above prerequisites, follow the steps below:

  1. Find nodes with low utilization, e.g., less that 0.5, by inspecting the Cluster Autoscaler logs:

    $ kubectl logs -n kube-system deploy/cluster-autoscaler -f --tail 100 | \
    >     grep "utilization 0.[0-4]"
    

    Note

    The Autoscaler does not report nodes that belong to an ASG that has already reached its minSize.

  2. Find out in which AZ your nodes are located:

    $ kubectl get nodes -o json | \
    >    jq -r '.items[] | .metadata.labels["failure-domain.beta.kubernetes.io/zone"], .metadata.name' | \
    >        paste - - | sort -k 1
    
  3. Find out in which AZ your EBS volumes are located:

    $ kubectl get pv -o json | \
    >    jq -r '.items[] | select(.spec.storageClassName == "gp2") | .metadata.labels["failure-domain.beta.kubernetes.io/zone"], .spec.claimRef.name' | \
    >       paste - - | sort -k 1
    
  4. Pick a node from the ones found in step 1 that satisfies any of the following conditions:

    • It is not the last node in an AZ.
    • It is the last node in an AZ where no EBS volumes exist.
  5. Go ahead, drain the node and let the Cluster Autoscaler eventually remove it.

Scale-out

Currently, we do not support automatic scale-out in case of insufficient Rok storage.

Important

If a Pod gets scheduled on a node with insuffient Rok storage, the PVC will be stuck Pending. Reporting storage capacity and rescheduling pods if storage fails to be provisioned is supported in Kubernetes 1.19 and is in alpha state (see https://kubernetes.io/docs/concepts/storage/storage-capacity/#rescheduling).

Still, if a pod becomes un-schedulable due to insufficient resources (CPU, RAM), the Cluster Autoscaler will trigger a scale-out, i.e., will increase the desired size of the ASG, and eventually, a new Kubernetes node will be added.

To scale-up the cluster manually, you can do it directly from EKS with:

$ aws eks update-nodegroup-config \
>     --cluster-name ${CLUSTERNAME?} \
>     --nodegroup-name general-workers \
>     --scaling-config minSize=2,maxSize=5,desiredSize=4

This will add a new node to the Kubernetes cluster and the Rok operator will scale the RokCluster members accordingly.

EKS Upgrade

Rok supports Kubernetes rolling updates seamlessly, as long as the nodes are gracefully drained. Here is the procedure that you should follow in general:

  1. Create new nodes that can fit the workloads of the old ones.
  2. Cordon old nodes.
  3. Drain old nodes one by one and let Rok snapshot, unpin and migrate any volumes.
  4. Delete old nodes.

The above procedure is automated for EKS clusters with managed node groups using the rolling update strategy, and thus you can perform it with the click of a button. For more info on how to upgrade an EKS cluster with managed node groups see the official docs, and make sure you select rolling update as the update strategy.

For self-managed node groups, it is a bit more complicated since you must do things manually. Official documentation for rolling updates, suggest to create a new node group via Cloudformation. We follow a different approach here, updating the existing underlying AutoScalingGroup (ASG) directly, and thus avoiding to

  • Depend on Cloudformation.
  • Update security groups so that two nodegroups can talk to each other.
  • Update aws-auth so that the new nodegroup can access the cluster.

Warning

Updates performed via Cloudformation or by letting the AutoScalingGroup scale down the cluster based on its Termination policy may cause data loss since it is not guaranteed that the node will be properly drained.

In the following steps we document how to upgrade Kubernetes, e.g., from 1.16 to 1.17. Specifically:

  1. Ensure that you have Scale-in protection enabled in ASG level and on each instance.

  2. Upgrade your control plane following the official docs.

  3. Scale the Cluster Autoscaler deployment down to 0 replicas to avoid conflicting scaling actions:

    $ kubectl scale deploy -n kube-system cluster-autoscaler --replicas=0
    
  4. Find the new AMI to use, e.g., for Kubernetes 1.17:

    $ aws ssm get-parameters \
    >     --name /aws/service/eks/optimized-ami/1.17/amazon-linux-2/recommended/image_id \
    >     --query 'Parameters[0].[Value]' --output text
    

    Note

    AMI IDs are unique to each AWS region. Please make sure that awscli is configured with the region where your EKS cluster resides as the default, or use the --region command-line argument.

  5. Update the ASG to use the new AMI. To do so:

    1. Go to https://console.aws.amazon.com/ec2autoscaling/.

    2. Find the ASG that the node group is associated with.

    3. Edit its Launch template.

      ../_images/asg-edit-lt1.png
    4. Create a new launch template version.

      ../_images/asg-new-lt-version.png
    5. Inspect the Storage (volumes) section and make a note of the disks configuration. You will need this later on.

      ../_images/lt-storage-volumes-old.png
    6. Set the new AMI to use.

      ../_images/lt-image-new.png
    7. This will reset the existing storage configuration.

      ../_images/lt-delete-volume-warning.png
    8. Modify the Storage (volumes) section and ensure that you maintain the same amount of storage for the root disk as before.

      ../_images/lt-storage-volumes-new.png
    9. Create the new version.

      ../_images/lt-success.png
    10. Go back to the Launch template section of the ASG, refresh the drop down menu with the versions and select the newly created one.

      ../_images/lt-update-version.png
  6. Edit the ASG configuration and double the Desired capacity so that instances with the new launch configuration will be added. If necessary, adjust the Maximum capacity accordingly. We opt to double the current size so that existing workloads can safely fit on the new nodes.

  7. Wait for all nodes to be added:

    $ kubectl get nodes
    
  8. Wait for the Rok cluster to scale out itself:

    $ kubectl get rokcluster -n rok rok
    
  9. Make sure all nodes have the desired ephemeral storage:

    $ kubectl get nodes -o json | \
    >    jq -r '.items[] | .metadata.name, .status.allocatable["ephemeral-storage"]' | paste - -
    

    This is done to ensure that you correctly copied the Storage (volumes) configuration from the old Launch template into the new one.

  10. Find the old nodes that you should drain, based on their Kubernetes version:

    1. Retrieve the Kubernetes versions running on your nodes currently:

      $ kubectl get nodes -o json | jq -r '.items[].status.nodeInfo.kubeletVersion' | sort -u
      
    2. Specify the old version:

      $ K8S_VERSION=v1.16.13-eks-ec92d4
      
    3. Find the nodes that run with this version:

      $ nodes=$(kubectl get nodes -o jsonpath="{.items[?(@.status.nodeInfo.kubeletVersion==\"$K8S_VERSION\")].metadata.name}")
      
  11. Cordon old nodes, i.e., disable scheduling on them:

    $ for node in $nodes; do kubectl cordon $node ; done
    
  12. Verify that the old nodes are unschedulable, while the new ones do not have any taints:

    $ kubectl get nodes --no-headers \
    $    -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
    
  13. Observe pod status in a separate window:

    $ kubectl get pods -A -w
    
  14. Drain the old nodes one-by-one:

    $ kubectl drain --ignore-daemonsets --delete-local-data $node
    

    Important

    Wait for the above command to finish successfully and ensure that all pods that got evicted, have migrated correctly and have become up-and-running again.

  15. After you have drained all the nodes, start the Cluster Autoscaler so that it sees the drained nodes, mark them as unneeded, terminate them and reduce the ASG’s desired size accordingly:

    $ kubectl scale deploy -n kube-system cluster-autoscaler --replicas=1
    

    Note

    The Cluster Autoscaler will not start deleting instances immediately, since after startup it considers the cluster to be in cool down state. In that state, it will not perform any scale down operations. After the cool down period has passed (10 minutes by default, configurable with the scale-down-delay-after-add argument), it will remove all drained nodes at once.