Upgrade GKE Node Pools

This section will guide you through upgrading the node pools of your GKE cluster to match the Kubernetes version of the control plane.

What You’ll Need

Procedure

  1. Ensure that Rok is up and running:

    root@rok-tools:~# kubectl get rokcluster -n rok rok \ > -o jsonpath='{.status.health}{"\n"}' OK
  2. Ensure that rest of the Pods are running. Verify that field STATUS is Running and field READY is N/N for all Pods:

    root@rok-tools:~# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-0 2/2 Running 0 1h cert-manager cert-manager-686bcc964d 1/1 Running 0 1h ...
  3. List the node pools of your cluster and the corresponding Kubernetes version:

    root@rok-tools:~# gcloud container clusters describe ${GKE_CLUSTER?} \ > --flatten="nodePools[]" \ > --format="table(nodePools.name,nodePools.version,nodePools.status)" NAME VERSION STATUS default-workers v1.21.5-gke.1805 RUNNING
  4. Specify the name of node pool to upgrade:

    root@rok-tools:~# export NODE_POOL_NAME=<NODE_POOL>

    Replace <NODE_POOL> with the name of the node pool running the old Kubernetes version. For example:

    root@rok-tools:~# export NODE_POOL_NAME=default-workers
  5. Find the instance group that corresponds to this node pool:

    root@rok-tools:~# export INSTANCE_GROUP=$(gcloud container node-pools \ > describe ${NODE_POOL_NAME?} \ > --cluster=${GKE_CLUSTER?} \ > --format="value(instanceGroupUrls)")
  6. Find the template of the instance group:

    root@rok-tools:~# export TEMPLATE=$(gcloud compute instance-groups managed \ > describe ${INSTANCE_GROUP?} \ > --format="value(instanceTemplate)")
  7. Inspect the configuration of the old node pool and note down the following configurations, as you are going to use them later:

    • the machine type

      root@rok-tools:~# gcloud container node-pools describe ${NODE_POOL_NAME?} \ > --cluster ${GKE_CLUSTER?} \ > --format="value(config.machineType)" n1-standard-8
    • the number of nodes

      root@rok-tools:~# gcloud container node-pools describe ${NODE_POOL_NAME?} \ > --cluster ${GKE_CLUSTER?} \ > --format="value(initialNodeCount)" 3
    • the number of local NVMe SSDs

      root@rok-tools:~# gcloud compute instance-templates describe ${TEMPLATE?} \ > --format json \ > | jq -r '[.properties.disks[] | select(.type == "SCRATCH" and .interface == "NVME")] | length' 3
  8. Follow the Create Node Pool guide to create a new node pool with a new name, the same Kubernetes minor version as the control plane, and the same machine type, number of nodes, and number of local NVMe SSDs that you found in the previous step. Then, come back to this guide and continue with this procedure.

  9. Find the nodes of the old node pool:

    root@rok-tools:~# nodes=$(kubectl get nodes \ > -o jsonpath="{range .items[?(@.metadata.labels.cloud\.google\.com\/gke-nodepool== \ > \"${NODE_POOL_NAME?}\")]}{.metadata.name}{\"\n\"}") \ > && echo "${nodes?}" gke-arrikto-cluster-default-workers-a089030c-32wn gke-arrikto-cluster-default-workers-a089030c-7q2b gke-arrikto-cluster-default-workers-a089030c-gfc9
  10. Cordon old nodes, that is, disable scheduling on them:

    root@rok-tools:~# for node in $nodes; do kubectl cordon $node ; done node/gke-arrikto-cluster-default-workers-a089030c-32wn cordoned node/gke-arrikto-cluster-default-workers-a089030c-7q2b cordoned node/gke-arrikto-cluster-default-workers-a089030c-gfc9 cordoned
  11. Drain the old nodes one-by-one. Repeat steps a-d for each one of the old nodes:

    1. Pick a node from the old node pool:

      root@rok-tools:# export node=<NODE>

      Replace <NODE> with the node you want to drain, for example:

      root@rok-tools:~# export node=gke-arrikto-cluster-default-workers-a089030c-32wn
    2. Drain the node:

      root@rok-tools:~# kubectl drain --ignore-daemonsets --delete-local-data $node node/gke-arrikto-cluster-default-workers-a089030c-32wn already cordoned evicting pod "rok-redis-0" evicting pod "ml-pipeline-scheduledworkflow-7bddd546b-4f4j5" ...

      Note

      This may take a while, since Rok is unpinning all volumes on this node and will evict rok-csi-guard Pods last.

      Warning

      Do not delete rok-csi-guard Pods manually, since this might cause data loss.

      Troubleshooting

      The command does not complete.

      Most likely the unpinning of a Rok PVC fails. Inspect the logs of the Rok CSI Controller to debug further.

    3. Wait for the drain command to finish successfully.

    4. Ensure that all Pods that got evicted have migrated correctly and are up and running again.

      1. Ensure that Rok has scaled up and is up and running:

        root@rok-tools:# kubectl get rokcluster -n rok rok \ > -o jsonpath='{.status.health}{"\n"}' OK
      2. Ensure that rest of the Pods are running. Verify that field STATUS is Running and field READY is N/N for all Pods:

        root@rok-tools:# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE auth dex-0 2/2 Running 0 1h cert-manager cert-manager-686bcc964d 1/1 Running 0 1h ...

        Note

        rok-csi-guard Pods are expected to be in Pending status.

    5. Go back to step a, and repeat the steps for the remaining old nodes.

  12. Delete the old node pool:

    root@rok-tools:~# gcloud container node-pools delete ${NODE_POOL_NAME?} \ > --cluster ${GKE_CLUSTER?} The following node pool will be deleted. [default-workers] in cluster [arrikto-cluster] in [us-east1-b] Do you want to continue (Y/n)? Y Deleting node pool default-workers...done. Deleted [https://container.googleapis.com/v1/projects/myproject/zones/us-east1-b/clusters/arrikto-cluster/nodePools/default-workers].

Verify

  1. Ensure that all nodes in the node pool are ready and run the new Kubernetes version. Verify that field STATUS is Ready and field VERSION shows the new Kubernetes version. Choose one the following options, based on the upgrade you’ve made:

    root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION gke-test-upgrade-new-workers-f929841f-02q5 Ready <none> 78m v1.22.12-gke.500 gke-test-upgrade-new-workers-f929841f-mpdc Ready <none> 78m v1.22.12-gke.500 gke-test-upgrade-new-workers-f929841f-gr4x Ready <none> 78m v1.22.12-gke.500

Summary

You have successfully upgraded your node pool.

What’s Next

The next step is to update the maintenance exclusion of your GKE cluster.