Upgrade GKE Node Pools¶
This section will guide you through upgrading the node pools of your GKE cluster to match the Kubernetes version of the control plane.
Overview
What You’ll Need¶
- A configured management environment.
- An existing GKE cluster.
- An existing Rok deployment.
Procedure¶
Ensure that Rok is up and running:
root@rok-tools:~# kubectl get rokcluster -n rok rok \ > -o jsonpath='{.status.health}{"\n"}' OKEnsure that rest of the Pods are running. Verify that field STATUS is Running and field READY is n/n for all Pods:
root@rok-tools:~# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-7747dff999-xqxp 2/2 Running 0 1h cert-manager cert-manager-686bcc964d 1/1 Running 0 1h ...List the node pools of your cluster and the corresponding Kubernetes version:
root@rok-tools:~# gcloud container clusters describe ${GKE_CLUSTER?} \ > --flatten="nodePools[]" \ > --format="table(nodePools.name,nodePools.version,nodePools.status)" NAME VERSION STATUS default-workers 1.19.16-gke.6800 RUNNINGSpecify the name of node pool to upgrade:
root@rok-tools:~# export NODE_POOL_NAME=<NODE_POOL>Replace
<NODE_POOL>
with the name of the node pool running the old Kubernetes version. For example:root@rok-tools:~# export NODE_POOL_NAME=default-workersFind the instance group that corresponds to this node pool:
root@rok-tools:~# export INSTANCE_GROUP=$(gcloud container node-pools \ > describe ${NODE_POOL_NAME?} \ > --cluster=${GKE_CLUSTER?} \ > --format="value(instanceGroupUrls)")Find the template of the instance group:
root@rok-tools:~# export TEMPLATE=$(gcloud compute instance-groups managed \ > describe ${INSTANCE_GROUP?} \ > --format="value(instanceTemplate)")Inspect the configuration of the old node pool and note down the following configurations, as you are going to use them later:
the machine type
root@rok-tools:~# gcloud container node-pools describe ${NODE_POOL_NAME?} \ > --cluster ${GKE_CLUSTER?} \ > --format="value(config.machineType)" n1-standard-8the number of nodes
root@rok-tools:~# gcloud container node-pools describe ${NODE_POOL_NAME?} \ > --cluster ${GKE_CLUSTER?} \ > --format="value(initialNodeCount)" 3the number of local NVMe SSDs
root@rok-tools:~# gcloud compute instance-templates describe ${TEMPLATE?} \ > --format json \ > | jq -r '[.properties.disks[] | select(.type == "SCRATCH" and .interface == "NVME")] | length' 3
Follow the Create Node Pool guide to create a new node pool with a new name, the same Kubernetes minor version as the control plane, and the same machine type, number of nodes, and number of local NVMe SSDs that you found in the previous step. Then, come back to this guide and continue with this procedure.
Find the nodes of the old node pool:
root@rok-tools:~# nodes=$(kubectl get nodes \ > -o jsonpath="{range .items[?(@.metadata.labels.cloud\.google\.com\/gke-nodepool== \ > \"${NODE_POOL_NAME?}\")]}{.metadata.name}{\"\n\"}") \ > && echo "${nodes?}" gke-arrikto-cluster-default-workers-a089030c-32wn gke-arrikto-cluster-default-workers-a089030c-7q2b gke-arrikto-cluster-default-workers-a089030c-gfc9Cordon old nodes, that is, disable scheduling on them:
root@rok-tools:~# for node in $nodes; do kubectl cordon $node ; done node/gke-arrikto-cluster-default-workers-a089030c-32wn cordoned node/gke-arrikto-cluster-default-workers-a089030c-7q2b cordoned node/gke-arrikto-cluster-default-workers-a089030c-gfc9 cordonedDrain the old nodes one-by-one. Repeat steps a-d for each one of the old nodes:
Pick a node from the old node pool:
root@rok-tools:# export node=<NODE>Replace
<NODE>
with the node you want to drain, for example:root@rok-tools:~# export node=gke-arrikto-cluster-default-workers-a089030c-32wnDrain the node:
root@rok-tools:~# kubectl drain --ignore-daemonsets --delete-local-data $node node/gke-arrikto-cluster-default-workers-a089030c-32wn already cordoned evicting pod "rok-redis-0" evicting pod "ml-pipeline-scheduledworkflow-7bddd546b-4f4j5" ...Note
This may take a while, since Rok is unpinning all volumes on this node and will evict
rok-csi-guard
Pods last.Warning
Do not delete
rok-csi-guard
Pods manually, since this might cause data loss.Troubleshooting
The command does not complete.
Most likely the unpinning of a Rok PVC fails. Inspect the logs of the Rok CSI Controller to debug further.
Wait for the drain command to finish successfully.
Ensure that all Pods that got evicted have migrated correctly and are up and running again.
Ensure that Rok has scaled up and is up and running:
root@rok-tools:# kubectl get rokcluster -n rok rok \ > -o jsonpath='{.status.health}{"\n"}' OKEnsure that rest of the Pods are running. Verify that field STATUS is Running and field READY is n/n for all Pods:
root@rok-tools:# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-7747dff999-xqxp 2/2 Running 0 1h cert-manager cert-manager-686bcc964d 1/1 Running 0 1h ...Note
rok-csi-guard
Pods are expected to be in Pending status.
Go back to step a, and repeat the steps for the remaining old nodes.
Delete the old node pool:
root@rok-tools:~# gcloud container node-pools delete ${NODE_POOL_NAME?} \ > --cluster ${GKE_CLUSTER?} The following node pool will be deleted. [default-workers] in cluster [arrikto-cluster] in [us-east1-b] Do you want to continue (Y/n)? Y Deleting node pool default-workers...done. Deleted [https://container.googleapis.com/v1/projects/myproject/zones/us-east1-b/clusters/arrikto-cluster/nodePools/default-workers].
Verify¶
Ensure that all nodes in the node pool are ready and run the new Kubernetes version. Verify that field STATUS is Ready and field VERSION shows the new Kubernetes version. Choose one the following options, based on the upgrade you’ve made:
root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION gke-test-upgrade-new-workers-f929841f-02q5 Ready <none> 78m v1.21.5-gke.1805 gke-test-upgrade-new-workers-f929841f-mpdc Ready <none> 78m v1.21.5-gke.1805 gke-test-upgrade-new-workers-f929841f-gr4x Ready <none> 78m v1.21.5-gke.1805root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION gke-test-upgrade-new-workers-f929841f-02q5 Ready <none> 78m v1.20.15-gke.300 gke-test-upgrade-new-workers-f929841f-mpdc Ready <none> 78m v1.20.15-gke.300 gke-test-upgrade-new-workers-f929841f-gr4x Ready <none> 78m v1.20.15-gke.300
What’s Next¶
The next step is to update the maintenance exclusion of your GKE cluster.