Upgrade EKS Managed Node Groups¶
This section will guide you through upgrading the managed node groups of your EKS cluster to match the Kubernetes version of the control plane. To perform the upgrade you can choose between the following ways:
- Automatic update where EKS will create a new launch template version and update the underlying ASG.
- Manual update where you will create a new node group, drain and remove the old one.
Overview
What You’ll Need¶
- A configured management environment.
- An existing EKS cluster.
- An existing Rok deployment.
Procedure¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context:
root@rok-tools:~/ops/deployments:~# source <(cat deploy/env.eks-cluster)root@rok-tools:~/ops/deployments:~# export EKS_CLUSTEREnsure that Rok is up and running:
root@rok-tools:~/ops/deployments# kubectl get rokcluster -n rok rok \ > -o jsonpath='{.status.health}{"\n"}' OKEnsure that rest of the Pods are running. Verify field STATUS is Running and field READY is n/n for all Pods:
root@rok-tools:~/ops/deployments# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-7747dff999-xqxp 2/2 Running 0 1h cert-manager cert-manager-686bcc964d 1/1 Running 0 1h ...List the managed node groups of your cluster and the corresponding AMI version:
root@rok-tools:~/ops/deployments# aws eks list-nodegroups \ > --cluster-name ${EKS_CLUSTER?} \ > --query nodegroups[] \ > --output text \ > | xargs -n1 aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --query nodegroup.[nodegroupName,releaseVersion,status] \ > --output text \ > --nodegroup-name \ > | column -t general-workers 1.19.15-20220429 ACTIVE gpu-workers 1.20.11-20220429 ACTIVESpecify the name of node group to upgrade:
root@rok-tools:~/ops/deployments# export NODEGROUP=<NODEGROUP>Replace
<NODEGROUP>
with the name of the node group running the old Kubernetes version. For example:root@rok-tools:~/ops/deployments# export NODEGROUP=general-workersSelect the upgrade method. Choose one of the following options based on whether you want an automated way managed by EKS or a manual way where you precisely control the whole process.
Specify the AMI version of the Amazon EKS optimized AMI to use. Choose one of the following options, based on the upgrade you need to make:
root@rok-tools:~/ops/deployments# export EKS_NODEGROUP_AMI_VERSION=1.21.5-20220429root@rok-tools:~/ops/deployments# export EKS_NODEGROUP_AMI_VERSION=1.20.11-20220429Update the node group version. Choose one of the following options, based on the upgrade you need to make:
root@rok-tools:~/ops/deployments# aws eks update-nodegroup-version \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --release-version ${EKS_NODEGROUP_AMI_VERSION?} { "update": { "id": "10374590-ff71-348c-a247-f5dccd479359", "status": "InProgress", "type": "VersionUpdate", "params": [ { "type": "Version", "value": "1.21" }, { "type": "ReleaseVersion", "value": "1.21.5-20220429" } ], "createdAt": "2021-10-26T11:50:11.603000+03:00", "errors": [] } }root@rok-tools:~/ops/deployments# aws eks update-nodegroup-version \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --release-version ${EKS_NODEGROUP_AMI_VERSION?} { "update": { "id": "10374590-ff71-348c-a247-f5dccd479359", "status": "InProgress", "type": "VersionUpdate", "params": [ { "type": "Version", "value": "1.20" }, { "type": "ReleaseVersion", "value": "1.20.11-20220429" } ], "createdAt": "2021-10-26T11:50:11.603000+03:00", "errors": [] } }The node group status will become UPDATING. Wait for it to become ACTIVE again:
root@rok-tools:~/ops/deployments# watch -n5 aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.status \ > --output text ACTIVE
Inspect the configuration of the old node group and note down the following configurations, as you are going to use them later:
the scaling config
root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig { "minSize": 0, "maxSize": 3, "desiredSize": 1 }the instance type
root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.instanceTypes[] \ > --output text m5d.4xlargethe subnets
root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.subnets[] \ > --output text subnet-02119fd328325646c
Follow the Create EKS Managed Node Group guide to create a new node group with a new name (along with a new CF stack name) and the same scaling configuration, instance types, and subnets that you found in the previous step. Then, come back to this guide and continue with this procedure.
Scale the Cluster Autoscaler deployment down to zero replicas to avoid conflicting scaling actions:
root@rok-tools:~/ops/deployments# kubectl scale deploy \ > -n kube-system cluster-autoscaler \ > --replicas=0 deployment.apps/cluster-autoscaler scaledFind the nodes of the old node group:
root@rok-tools:~/ops/deployments# nodes=$(kubectl get nodes \ > -o jsonpath="{range .items[?(@.metadata.labels.eks\.amazonaws\.com/nodegroup== \ > \"${NODEGROUP?}\")]}{.metadata.name}{\"\n\"}") \ > && echo "${nodes?}" ip-172-31-32-188.eu-central-1.compute.internal ip-172-31-34-84.eu-central-1.compute.internalCordon old nodes, that is, disable scheduling on them:
root@rok-tools:~/ops/deployments# for node in $nodes; do kubectl cordon $node ; done node/ip-172-31-32-188.eu-central-1.compute.internal cordoned node/ip-172-31-34-84.eu-central-1.compute.internal cordonedDrain the old nodes one-by-one. Repeat steps i-iv for each one of the old nodes:
Pick a node from the old node group:
root@rok-tools:~/ops/deployments# export node=<node>Replace
<node>
with the node you want to drain, for example:root@rok-tools:~/ops/deployments# export node=ip-172-31-32-188.eu-central-1.compute.internalDrain the node:
root@rok-tools:~# kubectl drain --ignore-daemonsets --delete-local-data $node node/ip-172-31-32-188.eu-central-1.compute.internal already cordoned evicting pod "rok-redis-0" evicting pod "ml-pipeline-scheduledworkflow-7bddd546b-4f4j5" ...Note
This may take a while, since Rok is unpinning all volumes on this node and will evict
rok-csi-guard
Pods last.Warning
Do not delete
rok-csi-guard
Pods manually, since this might cause data loss.Troubleshooting
The command does not complete.
Most likely the unpinning of a Rok PVC fails. Inspect the logs of Rok CSI controller to debug further.
Wait for the drain command to finish successfully.
Ensure that all Pods that got evicted have migrated correctly and are up and running again.
Ensure that Rok has scaled up and is up and running:
root@rok-tools:~/ops/deployments# kubectl get rokcluster -n rok rok \ > -o jsonpath='{.status.health}{"\n"}' OKEnsure that rest of the Pods are running. Verify field STATUS is Running and field READY is n/n for all Pods:
root@rok-tools:~/ops/deployments# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-7747dff999-xqxp 2/2 Running 0 1h cert-manager cert-manager-686bcc964d 1/1 Running 0 1h ...Note
rok-csi-guard
Pods are expected to be in Pending status.
Go back to step i, and repeat the steps for the remaining old nodes.
Update the node group configuration to allow scaling down to zero:
root@rok-tools:~/ops/deployments# aws eks update-nodegroup-config \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --scaling-config minSize=0Start the Cluster Autoscaler so that it sees the drained nodes, marks them as unneeded, terminates them, and modifies the desiredSize of the old node group accordingly:
root@rok-tools:~/ops/deployments# kubectl scale deploy \ > -n kube-system cluster-autoscaler \ > --replicas=1 deployment.apps/cluster-autoscaler scaledNote
The Cluster Autoscaler will not start deleting instances immediately, since after startup it considers the cluster to be in cool down state. In that state, it will not perform any scale down operations. After the cool down period has passed (10 minutes by default, configurable with the
scale-down-delay-after-add
argument), it will remove all drained nodes at once.Ensure that the node group has been scaled to zero:
root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig.desiredSize 0Ensure that old nodes have been removed from the Kubernetes cluster:
root@rok-tools:~/ops/deployments# kubectl get nodes \ > -o jsonpath="{.items[?(@.metadata.labels.eks\.amazonaws\.com/nodegroup==\"${NODEGROUP?}\")].metadata.name}" | \ > wc -c 0Delete the old node group:
root@rok-tools:~/ops/deployments# aws eks delete-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?}
Verify¶
Ensure that all nodes in the node group are ready and run the new Kubernetes version. Check that field STATUS is Ready and field VERSION shows the new Kubernetes version. Choose one of the following options, based on the upgrade you’ve made:
root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION ip-172-31-32-188.eu-central-1.compute.internal Ready <none> 1h v1.21.5-eks-bc4871b ip-172-31-34-84.eu-central-1.compute.internal Ready <none> 1h v1.21.5-eks-bc4871broot@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION ip-172-31-32-188.eu-central-1.compute.internal Ready <none> 1h v1.20.11-eks-f17b81 ip-172-31-34-84.eu-central-1.compute.internal Ready <none> 1h v1.20.11-eks-f17b81
What’s Next¶
The next step is to configure the Rok Scheduler for the Kubernetes version of your EKS cluster.