Upgrade EKS Self-Managed Node Groups¶

The following steps describe how to upgrade self-managed node groups to a newer Kubernetes version. To upgrade self-managed node groups to a newer version you will directly modify the underlying Auto Scaling group (ASG).

Note

Updates performed in other ways, such as Cloudformation or by letting the AutoScalingGroup scale down the cluster based on its Termination policy, may cause data loss since it is not guaranteed that they will properly drain the node.

Overview

What You'll Need
Check Your Environment
Procedure
Verify
Summary
What's Next

What You'll Need ¶

A configured management environment.
An existing EKS cluster.
An existing Rok deployment.

Check Your Environment ¶

Before you start upgrading the EKS self-managed node group, ensure that you have enabled Scale-in protection.

Procedure ¶

Scale the Cluster Autoscaler deployment down to 0 replicas to avoid conflicting scaling actions:

root@rok-tools:~# kubectl scale deploy -n kube-system cluster-autoscaler --replicas=0
deployment.apps/cluster-autoscaler scaled

Determine the AMI name that you should use depending on the Kubernetes version you are upgrading to. Take a note of the correct AMI name as you are going to use it later:

Supported AMIs¶

Kubernetes version upgrading to AMI name

1.18 amazon-eks-node-1.18-v20220112

1.19 amazon-eks-node-1.19-v20220112

Find all the Auto Scaling groups associated with your EKS cluster:

root@rok-tools:~# aws autoscaling describe-auto-scaling-groups | \
> jq -r '.AutoScalingGroups[] |
> select(.Tags[] | .Key == "kubernetes.io/cluster/'${EKS_CLUSTER?}'" and .Value == "owned") |
> .AutoScalingGroupName'
arrikto-cluster-workers-NodeGroup-1WW76ULXL3VOE
...

Repeat steps a-k below for each one of the Auto Scaling groups in the list shown in the previous step.

Update the ASG to use the new AMI. To do so:
1. Go to https://console.aws.amazon.com/ec2autoscaling/.
2. Select the ASG from the list.
3. Edit its Launch template.
4. Create a new launch template version.
5. Inspect the Storage (volumes) section and make a note of the disks configuration. You will need this later on.
6. Set the new AMI name to use. In the drop-down list, select the correct AMI, according to step 3.
7. Setting a new AMI will reset the existing storage configuration. Click on Confirm Changes and proceed with it.
8. Modify the Storage (volumes) section and ensure that you maintain the same amount of storage for the root disk as before.
9. Create the new version.
10. Go back to the Launch template section of the ASG, refresh the drop-down menu with the versions, select the newly created one and update the template.
Edit the ASG configuration and double the Desired capacity so that instances with the new launch configuration will be added. If the new value for Desired capacity exceeds the Maximum capacity, adjust the Maximum capacity to match the new value. We opt to double the current size so that existing workloads can safely fit on the new nodes.

Wait for all nodes to be added:

root@rok-tools:~# kubectl get nodes
NAME                                             STATUS   ROLES    AGE   VERSION
ip-172-31-32-188.eu-central-1.compute.internal   Ready    <none>   1h    v1.18.20-eks-8c579e
ip-172-31-34-84.eu-central-1.compute.internal    Ready    <none>   1h    v1.18.20-eks-8c579e
ip-172-31-44-254.eu-central-1.compute.internal   Ready    <none>   1m    v1.19.13-eks-8c579e
ip-172-31-47-215.eu-central-1.compute.internal   Ready    <none>   1m    v1.19.13-eks-8c579e

Wait for the Rok cluster to scale out itself:

root@rok-tools:~# kubectl get rokcluster -n rok rok
NAME   VERSION                          HEALTH   TOTAL MEMBERS   READY   MEMBERS   PHASE     AGE
rok    release-1.3-l0-release-1.3-rc7   OK       4               4       4         Running   1h

Make sure all nodes have the same ephemeral storage:

root@rok-tools:~# kubectl get nodes -o json | \
> jq -r '.items[] | .metadata.name, .status.allocatable["ephemeral-storage"]' | paste - -
ip-172-31-32-188.eu-central-1.compute.internal    192188443124
ip-172-31-34-84.eu-central-1.compute.internal     192188443124
ip-172-31-44-254.eu-central-1.compute.internal    192188443124
ip-172-31-47-215.eu-central-1.compute.internal    192188443124

Note

This way we ensure that you correctly copied the Storage (volumes) configuration from the old Launch template into the new one.

Find the old nodes that you should drain, based on their Kubernetes version:

Retrieve the Kubernetes versions running on your nodes currently:

root@rok-tools:~# kubectl get nodes -o json | \
> jq -r '.items[].status.nodeInfo.kubeletVersion' | sort -u
v1.18.20-eks-8c579e
v1.19.13-eks-8df270

Specify the old version:

root@rok-tools:~# K8S_VERSION=v1.18.19-eks-8df270

Find the nodes that run with this version:

root@rok-tools:~# nodes=$(kubectl get nodes -o jsonpath="{.items[?(@.status.nodeInfo.kubeletVersion==\"$K8S_VERSION\")].metadata.name}")

Cordon old nodes, that is, disable scheduling on them:

root@rok-tools:~# for node in $nodes; do kubectl cordon $node ; done
node/ip-172-31-32-188.eu-central-1.compute.internal cordoned
node/ip-172-31-34-84.eu-central-1.compute.internal cordoned

Verify that the old nodes are unschedulable, while the new ones do not have any taints:

root@rok-tools:~# kubectl get nodes --no-headers \
> -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
ip-172-31-32-188.eu-central-1.compute.internal   [map[effect:NoSchedule key:node.kubernetes.io/unschedulable timeAdded:2021-08-20T11:47:34Z]]
ip-172-31-34-84.eu-central-1.compute.internal    [map[effect:NoSchedule key:node.kubernetes.io/unschedulable timeAdded:2021-08-20T11:47:38Z]]
ip-172-31-44-254.eu-central-1.compute.internal   <none>
ip-172-31-47-215.eu-central-1.compute.internal   <none>

Drain the old nodes one-by-one. Repeat steps i-iv for each one of the old nodes:

Find a node that runs with the old version:

root@rok-tools:~# node=$(kubectl get nodes -o jsonpath="{.items[?(@.status.nodeInfo.kubeletVersion==\"$K8S_VERSION\")].metadata.name}" | sed 's/ /\n/g' | head -n 1)

Drain the node:

root@rok-tools:~# kubectl drain --ignore-daemonsets --delete-local-data $node
node/ip-172-31-32-188.eu-central-1.compute.internal already cordoned
evicting pod "rok-redis-0"
evicting pod "ml-pipeline-scheduledworkflow-7bddd546b-4f4j5"
...

Wait for the above command to finish successfully and ensure that all pods that got evicted have migrated correctly and have become up-and-running again. Verify field STATUS is Running and field READY is n/n for all Pods:

root@rok-tools:~# kubectl get pods -A
NAMESPACE           NAME                    READY   STATUS    RESTARTS   AGE
auth                dex-7747dff999-xqxp     2/2     Running   0          1h
cert-manager        cert-manager-686bcc964d 1/1     Running   0          1h
...

Go back to step i, and repeat the steps for the remaining old nodes.

Go back to step 1, and repeat the steps for the remaining ASGs.

After you have drained all the nodes for all the ASGs, start the Cluster Autoscaler so that it sees the drained nodes, marks them as unneeded, terminates them, and reduces each ASG's desired size accordingly:
```
root@rok-tools:~# kubectl scale deploy -n kube-system cluster-autoscaler --replicas=1
deployment.apps/cluster-autoscaler scaled
```
Note

The Cluster Autoscaler will not start deleting instances immediately, since after startup it considers the cluster to be in cool down state. In that state, it will not perform any scale down operations. After the cool down period has passed (10 minutes by default, configurable with the scale-down-delay-after-add argument), it will remove all drained nodes at once.

Verify ¶

Ensure that all nodes in the node group are ready and run the new Kubernetes version. Check that field STATUS is Ready and field VERSION shows the new Kubernetes version:

root@rok-tools:~# kubectl get nodes
NAME                                              STATUS   ROLES   AGE     VERSION
ip-172-31-32-188.eu-central-1.compute.internal    Ready    <none>   1h     v1.19.13-eks-8df270
ip-172-31-34-84.eu-central-1.compute.internal     Ready    <none>   1h     v1.19.13-eks-8df270

Summary ¶

You have successfully upgraded your self-managed node group.

What's Next ¶

The next step is to upgrade the Autoscaler of your EKS cluster.

Configure Cluster Autoscaler for your Kubernetes Version

Previous Next

Supported AMIs¶
Kubernetes version upgrading to	AMI name
1.18	amazon-eks-node-1.18-v20220112
1.19	amazon-eks-node-1.19-v20220112

Upgrade EKS Self-Managed Node Groups¶

What You'll Need¶

Check Your Environment¶

Procedure¶

Verify¶

Summary¶

What's Next¶