Rok 1.3¶
Important
This version is no longer supported. Please follow Rok 1.4 to upgrade Rok to the latest version.
This guide describes the necessary steps to upgrade an existing Rok cluster on Kubernetes from version 1.2 to the latest version, 1.4.4.
Check Kubernetes Version¶
Rok 1.3 only supports Kubernetes version 1.17, 1.18 or 1.19. Follow the instructions bellow to verify the Kubernetes version for your cluster, before continuing with the upgrade.
Check your cluster version by inspecting the value of
Server Version
in the following command:root@rok-tools:/# kubectl version --short Client Version: v1.18.19 Server Version: v1.19.13-eks-8df270
If your Server Version is
v1.17.*
orv1.18.*
orv.1.19.*
you may proceed with the upgrade. If not, please follow the upgrade your cluster guide to upgrade your cluster to Kubernetes 1.17, 1.18 or 1.19.
Upgrade your management environment¶
We assume that you have followed the Deploy Rok Components guide, and have
successfully set up a full-fledged rok-tools
management environment either
in local Docker or in Kubernetes.
Before proceeding with the core upgrade steps you need to first upgrade your
management environment, in order to use CLI tools and utilities, such as
rok-deploy
that are compatible with the Rok version you are upgrading to.
Important
When you upgrade your management environment all previous data (GitOps
repository, files, user settings, etc.) are preserved either in a Docker
volume or Kubernetes PVC, depending on your environment. This volume or PVC
is mounted in the new rok-tools
container so that old data is adopted.
For Kubernetes simply apply the latest rok-tools
manifests:
$ kubectl apply -f <download_root>/rok-tools-eks.yaml
Note
In case you see the following error:
The StatefulSet "rok-tools" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'template', and
'updateStrategy' are forbidden
make sure your first delete the existing rok-tools
StatefulSet with:
root@rok-tools:/# kubectl delete sts rok-tools
and then re-apply.
For Docker first delete the old container:
root@rok-tools:/# docker stop <OLD_ROK_TOOLS_CONTAINER_ID>
root@rok-tools:/# docker rm <OLD_ROK_TOOLS_CONTAINER_ID>
and then create a new one with previous data and the new image:
root@rok-tools:/# docker run -ti \
> --name rok-tools \
> --hostname rok-tools \
> -p 8080:8080 \
> --entrypoint /bin/bash \
> -v $(pwd)/rok-tools-data:/root \
> -v /var/run/docker.sock:/var/run/docker.sock \
> -w /root \
> gcr.io/arrikto/rok-tools:release-1.4-l0-release-1.4.4
Upgrade manifests¶
We assume that you have followed the Deploy Rok Components guide, and have a local GitOps repo with Arrikto-provided manifests. Once Arrikto releases a new Rok version and pushes updated deployment manifests, you have to follow the standard GitOps workflow:
- Fetch latest upstream changes, pushed by Arrikto.
- Rebase local changes on top of the latest upstream ones and resolve conflicts, if any.
- Tweak manifests based on Arrikto-provided instructions, if necessary.
- Commit everything.
- Re-apply manifests.
When one initially deploys Rok on Kubernetes, either automatically using
rok-deploy
or manually, they end up with a deploy
overlay in each Rok
component or external service that is to be applied to Kubernetes. In the GitOps
deployment repository, Arrikto provides manifests that include the deploy
overlay in each Kustomize app/package as scaffold, so that users can quickly
start and set their preferences.
As a result, fetch/rebase might lead to conflicts since both Arrikto and the end-user might modify the same files that are tracked by Git. In this scenario, the most common and obvious solution is to keep the user's changes since they are the ones that reflect the existing deployment.
In case of breaking changes, e.g., parts of YAML documents that are absolutely necessary to perform the upgrade, or others that might be deprecated, Arrikto will inform users via version-specific upgrade nodes for all actions that need to be taken.
Note
It is the user's responsibility to apply valid manifests and kustomizations after a rebase. In case of uncertainty do not hesitate to coordinate with Arrikto's Tech Team for support.
We will use git
to update local manifests. You are about to rebase your work
on top of latest pre-release branch. To favor local changes upon conflicts, we
will use the corresponding merge strategy option.
Important
Make sure you mirror the GitOps repo to a private remote to be able to recover it in case of failure.
To upgrade the manifests:
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:/# cd ~/ops/deployments
Fetch latest upstream changes:
root@rok-tools:~/ops/deployments# git fetch --all -p Fetching origin
Ensure the release channel you are currently following is
release-1.2
:root@rok-tools:~/ops/deployments# git rev-parse --abbrev-ref --symbolic-full-name @{u} origin/release-1.2
If you are following the
release-1.3
release channel already, you can skip to step 5.Follow the Switch release channel section to update to the
release-1.3
release channel. You can skip this step if you are already in therelease-1.3
release channel.Rebase on top of the latest pre-release version:
root@rok-tools:~/ops/deployments# git rebase -Xtheirs
Troubleshooting
CONFLICT (modify/delete)
Rebasing your work may cause conflicts when you have modified a file that has been removed from the latest version of Arrikto manifests. In such a case the rebase process will fail with:
CONFLICT (modify/delete): path/to/file deleted in origin/release-1.4 and modified in HEAD~X. Version HEAD~X of path/to/file left in tree.
Delete those files:
root@rok-tools:~/ops/deployments# git status --porcelain | \ > awk '{if ($1=="DU") print $2}' | \ > xargs git rm
Continue the rebase process:
root@rok-tools:~/ops/deployments# git rebase --continue
Air Gapped
- Follow the Mirror Images to Internal Registry guide to mirror all new images to your internal Docker registry.
- Follow the Patch All Images for Your Deployment guide to patch all kustomizations to use the mirrored images from your internal Docker registry.
Drain rok-csi nodes¶
To ensure minimal disruption of Rok services, please follow the following instructions to drain Rok CSI nodes, and wait for any pending Rok CSI operations to complete, before performing the upgrade.
During the upgrade, any pending Rok tasks will be canceled, so it is advisable to run the following steps in a period of inactivity, e.g., when no pipelines or snapshot policies run. Since pausing/queuing everything is currently not an option, one can monitor Rok logs and wait until nothing has been logged for, let's say, 30 secs:
root@rok-tools:/# kubectl -n rok logs -l app=rok-csi-controller -c csi-controller -f --tail=100
Note
Finding a period of inactivity is an ideal scenario, that depending on the deployment may not be feasible, e.g., when having tens of recurring pipelines running. In such a case the end-user will simply see some of them fail.
Scale down the
rok-operator
StatefulSet:root@rok-tools:/# kubectl -n rok-system scale sts rok-operator --replicas=0
Ensure
rok-operator
scaled down to zero:root@rok-tools:/# kubectl get sts rok-operator -n rok-system
Scale down the
rok-csi-controller
StatefulSet:root@rok-tools:/# kubectl -n rok scale sts rok-csi-controller --replicas=0
Ensure
rok-csi-controller
scaled down to zero:root@rok-tools:/# kubectl get sts rok-csi-controller -n rok
Watch the
rok-csi-node
logs and ensure that all pending operations have finished, i.e., nothing has been logged for the last 30 secs:root@rok-tools:/# kubectl -n rok logs -l app=rok-csi-node -c csi-node -f --tail=100
Continue with the Upgrade components section.
Upgrade components¶
We assume that you are already running a 1.2 Rok cluster on Kubernetes and that you also have access to the 1.4.4 kustomization tree you are upgrading to.
Since a Rok cluster on Kubernetes consists of multiple components, you need to upgrade each one of them. Throughout the guide, we will keep track of these components, as listed in the table below:
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
During the upgrade, Rok Operator will remove all members from the cluster and add a dedicated one to perform the upgrade. The cluster will be scaled down to zero and a Kubernetes Job will run to upgrade the cluster config on etcd and run any needed migrations. Finally, the cluster will be scaled back up to its initial size.
1. Increase observability (optional)¶
To gain insight into the status of the cluster upgrade execute the following commands in a separate window:
For live cluster status:
root@rok-tools:/# watch kubectl get rokcluster -n rok
For live cluster events:
root@rok-tools:/# watch 'kubectl describe rokcluster -n rok rok | tail -n 20'
2. Inspect current version (optional)¶
Get current images and version from the RokCluster CR:
root@rok-tools:/# kubectl describe rokcluster rok -n rok
...
Spec:
Images:
Rok: gcr.io/arrikto-deploy/roke:release-1.2-l0-release-1.2
Rok CSI: gcr.io/arrikto-deploy/rok-csi:release-1.2-l0-release-1.2
Status:
Version: release-1.2-l0-release-1.2
3. Upgrade Rok Disk Manager¶
Apply the latest Rok Disk Manager manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-disk-manager/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
4. Upgrade Rok kmod¶
Apply the latest Rok kmod manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-kmod/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
5. Upgrade Rok cluster¶
Apply the latest Rok cluster manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-cluster/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
6. Upgrade Rok Operator¶
Apply the latest Operator manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-operator/overlays/deploy
Note
The above command also updates the RokCluster
CRD
After the manifests have been applied, ensure Rok Operator has become ready by running the following command:
root@rok-tools:/# watch kubectl get pods -n rok-system -l app=rok-operator
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
7. Verify successful upgrade for Rok¶
Check the status of the cluster upgrade Job:
root@rok-tools:/# kubectl get job -n rok rok-upgrade-release-1.4-l0-release-1.4.4
Ensure that Rok is up and running after the upgrade Job finishes:
root@rok-tools:/# kubectl get rokcluster -n rok rok
Ensure all pods in the
rok-system
namespace are up-and-running:root@rok-tools:/# kubectl get pods -n rok-system
Ensure all pods in the
rok
namespace are up-and-running:root@rok-tools:/# kubectl get pods -n rok
Upgrade Istio¶
Rok 1.4.4 uses Istio 1.9.6. To upgrade from Istio 1.9.5 follow the next steps:
Verify that Istio is up-and-running. Check that field READY is 1/1 and field UP-TO-DATE is 1:
root@rok-tools:~/ops/deployments# kubectl get deployments -n istio-system NAME READY UP-TO-DATE AVAILABLE AGE cluster-local-gateway 1/1 1 1 1m istio-ingressgateway 1/1 1 1 1m istiod 1/1 1 1 1m
Apply the new Istio control plane:
root@rok-tools:~/ops/deployments# rok-deploy --apply install/istio
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-external-services/istio/istio-1-9/cluster-local-gateway/overlays/deploy
Confirm that the
knative-serving
andkubeflow
namespaces, as well as all of the kubeflow user namespaces (namespaces that start withkubeflow-
) have Istio sidecar injection enabled. To do this, run the following command and confirm that these namespaces show up in the command's output:root@rok-tools:~/ops/deployments# kubectl get ns -l istio-injection=enabled NAME STATUS AGE knative-serving Active 5d16h kubeflow Active 5d16h kubeflow-user Active 5d16h ...
Upgrade the Istio sidecars, by deleting all Pods in the namespaces you found above. Istio will inject the new version sidecar once the owning controllers recreate the deleted Pods:
root@rok-tools:~/ops/deployments# kubectl get ns -l istio-injection=enabled --no-headers | \ > awk '{print $1}' | \ > xargs -n1 -I {} kubectl delete pod --all -n {}
Upgrade cert-manager¶
Rok 1.4.4 uses cert-manager 1.3.1. To upgrade from cert-manager 0.11.0 follow the next steps:
Apply the new cert-manager manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/cert-manager/cert-manager/overlays/deploy --force --force-kinds Deployment
Remove the deprecated resources left by the previous version of Cert Manager:
root@rok-tools:~/ops/deployments# rok-kf-prune --app cert-manager
Verify that cert-manager is up-and-running. Check that field READY is 1/1 for the corresponding deployments:
root@rok-tools:~/ops/deployments# kubectl get deploy -n cert-manager NAME READY UP-TO-DATE AVAILABLE AGE cert-manager 1/1 1 1 1m cert-manager-cainjector 1/1 1 1 1m cert-manager-webhook 1/1 1 1 1m
Upgrade external services to protect System Pods¶
Rok 1.4.4 modified the manifests of the Rok external services to protect critical System Pods from OOM kills, evictions and CPU starvation. To apply these changes:
Apply the new manifests for the Rok external services:
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > rok/rok-external-services/{etcd,postgresql,redis}/overlays/deploy
Verify that Rok external services are up-and-running. Check that field READY is 1/1 for the corresponding resources:
root@rok-tools:~/ops/deployments# kubectl get sts -n rok NAME READY AGE rok-etcd 1/1 1m rok-postgresql 1/1 1m rok-redis 1/1 1m
Apply the new manifests for the Dex/authservice external services:
Warning
Skip this step if you have deployed Kubeflow in your cluster.
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > rok/rok-external-services/{dex,authservice}/overlays/deploy
Verify that the Dex/authservice external services are up-and-running. Check that field READY is 1/1 for the corresponding resources:
Warning
Skip this step if you have deployed Kubeflow in your cluster.
root@rok-tools:~/ops/deployments# kubectl get sts -n istio-system authservice NAME READY AVAILABLE AGE authservice 1/1 1 1m
root@rok-tools:~/ops/deployments# kubectl get deploy -n auth NAME READY UP-TO-DATE AVAILABLE AGE dex 1/1 1 1 1m
(Azure only) Apply the new manifests for S3Proxy:
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > rok/rok-external-services/s3proxy/overlays/deploy
(Azure only) Verify that the S3 Proxy is up-and-running. Check that field READY is n, where n is the number of nodes in your cluster:
root@rok-tools:~/ops/deployments# kubectl get ds -n rok rok-s3proxy NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE rok-s3proxy 3 3 3 3 3 <none> 1m
Apply the new manifests for the NGINX Ingress Controller:
Note
If you haven't deployed the NGINX Ingress Controller, you may proceed to the Upgrade Kubeflow manifests section.
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > rok/nginx-ingress-controller/overlays/deploy
Verify that the NGINX Ingress Controller is up-and-running. Check that field READY is 1/1 for the corresponding resources:
root@rok-tools:~/ops/deployments# kubectl get deploy -n ingress-nginx NAME READY UP-TO-DATE AVAILABLE AGE nginx-ingress-controller 1/1 1 1 1m
Upgrade Kubeflow manifests¶
This section describes how to upgrade Kubeflow. If you have not deployed Kubeflow in your cluster, you can safely skip this section.
Run the following command to update your Kubeflow installation:
root@rok-tools:~/ops/deployments# rok-deploy --apply install/kubeflow
Upgrade Notebooks for EKF 1.3¶
Modify all Notebooks that are using the deprecated
kale-hp-tuning-image
PodDefault, and upgrade them to the latestkale-python-image
one, which uses a global Python image as a shim for Kale:root@rok-tools:~/ops/deployments# rok-notebook-upgrade \ > --label-selector kale-hp-tuning-image=true \ > --image gcr.io/arrikto/jupyter-kale-py36:release-1.3-l0-release-1.3-rc3-2-gadb3c0c1a \ > --remove-pod-default kale-hp-tuning-image \ > --add-pod-default kale-python-image
Delete the deprecated PodDefault from all namespaces:
root@rok-tools:/ops/deployments# NAMESPACES=$(ls kubeflow/manifests/common/namespace-resources/overlays) root@rok-tools:/ops/deployments# for ns in $NAMESPACES; do \ > kubectl -n $ns delete poddefaults.kubeflow.org kale-hp-tuning-image; \ > done
Upgrade JWA configuration¶
Update the Jupyter Web App configuration to use the new kale-python-image
PodDefault instead of kale-hp-tuning-image
, which has been removed.
Warning
Skip this section if you haven't previously enabled the
kale-hp-tuning-image
PodDefault.
Edit
kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
and update theconfigurations
list and replace the deprecatedkale-hp-tuning-image
label with thekale-python-image
one.configurations: value: - access-rok - access-ml-pipeline - kale-hp-tuning-image # <-- Remove this line - kale-python-image # <-- Add this line
The final result should look like this:
configurations: value: - access-rok - access-ml-pipeline - kale-python-image
Commit changes:
root@rok-tools:~/ops/deployments# git commit -am "Update enabled JWA PodDefaults"
Apply changes:
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy
Delete stale Kubeflow resources¶
Run the following command to remove the deprecated resources left by the previous version of Kubeflow:
root@rok-tools:~/ops/deployments# rok-kf-prune --app kubeflow
Verify successful upgrade¶
Follow the Test Kubeflow section to validate the updated Rok + EKF deployment.