Rok 1.2

This guide describes the necessary steps to upgrade an existing Rok cluster on Kubernetes from version 1.1 to the latest version, 1.4.4.

Check Kubernetes Version

Rok 1.2 only supports Kubernetes version 1.17 or 1.18. Follow the instructions bellow to verify the Kubernetes version for your cluster, before continuing with the upgrade.

  1. Check your cluster version by inspecting the value of Server Version in the following command:

    root@rok-tools:/# kubectl version --short
    Client Version: v1.17.17
    Server Version: v1.17.17-eks-c5067d
    
  2. If your Server Version is v1.17.* or v1.18.*, you may proceed with the upgrade. If not, please follow the upgrade your cluster guide to upgrade your cluster to Kubernetes 1.17.

Upgrade your management environment

We assume that you have followed the Deploy Rok Components guide, and have successfully set up a full-fledged rok-tools management environment either in local Docker or in Kubernetes.

Before proceeding with the core upgrade steps you need to first upgrade your management environment, in order to use CLI tools and utilities, such as rok-deploy that are compatible with the Rok version you are upgrading to.

Important

When you upgrade your management environment all previous data (GitOps repository, files, user settings, etc.) are preserved either in a Docker volume or Kubernetes PVC, depending on your environment. This volume or PVC is mounted in the new rok-tools container so that old data is adopted.

For Kubernetes simply apply the latest rok-tools manifests:

$ kubectl apply -f <download_root>/rok-tools-eks.yaml

Note

In case you see the following error:

The StatefulSet "rok-tools" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'template', and
'updateStrategy' are forbidden

make sure your first delete the existing rok-tools StatefulSet with:

$ kubectl delete sts rok-tools

and then re-apply.

For Docker first delete the old container:

$ docker stop <OLD_ROK_TOOLS_CONTAINER_ID>
$ docker rm <OLD_ROK_TOOLS_CONTAINER_ID>

and then create a new one with previous data and the new image:

$ docker run -ti \
>     --name rok-tools \
>     --hostname rok-tools \
>     -p 8080:8080 \
>     --entrypoint /bin/bash \
>     -v $(pwd)/rok-tools-data:/root \
>     -v /var/run/docker.sock:/var/run/docker.sock \
>     -w /root \
>     gcr.io/arrikto/rok-tools:release-1.4-l0-release-1.4.4

Upgrade manifests

We assume that you have followed the Deploy Rok Components guide, and have a local GitOps repo with Arrikto-provided manifests. Once Arrikto releases a new Rok version and pushes updated deployment manifests, you have to follow the standard GitOps workflow:

  1. Fetch latest upstream changes, pushed by Arrikto.
  2. Rebase local changes on top of the latest upstream ones and resolve conflicts, if any.
  3. Tweak manifests based on Arrikto-provided instructions, if necessary.
  4. Commit everything.
  5. Re-apply manifests.

When one initially deploys Rok on Kubernetes, either automatically using rok-deploy or manually, they end up with a deploy overlay in each Rok component or external service that is to be applied to Kubernetes. In the GitOps deployment repository, Arrikto provides manifests that include the deploy overlay in each Kustomize app/package as scaffold, so that users can quickly start and set their preferences.

As a result, fetch/rebase might lead to conflicts since both Arrikto and the end-user might modify the same files that are tracked by Git. In this scenario, the most common and obvious solution is to keep the user's changes since they are the ones that reflect the existing deployment.

In case of breaking changes, e.g., parts of YAML documents that are absolutely necessary to perform the upgrade, or others that might be deprecated, Arrikto will inform users via version-specific upgrade nodes for all actions that need to be taken.

Note

It is the user's responsibility to apply valid manifests and kustomizations after a rebase. In case of uncertainty do not hesitate to coordinate with Arrikto's Tech Team for support.

We will use git to update local manifests. You are about to rebase your work on top of latest pre-release branch. To favor local changes upon conflicts, we will use the corresponding merge strategy option.

Important

Make sure you mirror the GitOps repo to a private remote to be able to recover it in case of failure.

To upgrade the manifests:

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:/# cd ~/ops/deployments
    
  2. Save the current branch:

    root@rok-tools:~/ops/deployments# export OLD_BRANCH="$(git rev-parse --abbrev-ref HEAD)"
    
  3. Fetch latest upstream changes:

    root@rok-tools:~/ops/deployments# git fetch --all -p
    Fetching origin
    
  4. Ensure the release channel you are currently following is release-1.1:

    root@rok-tools:~/ops/deployments# git rev-parse --abbrev-ref --symbolic-full-name @{u}
    origin/release-1.1
    

    If you are following the release-1.2 release channel already, you can skip to step 6.

  5. Follow the Switch release channel section to update to the release-1.2 release channel. You can skip this step if you are already in the release-1.2 release channel.

  6. Rebase on top of the latest pre-release version:

    root@rok-tools:~/ops/deployments# git rebase -Xtheirs
    

Update Jupyter Web App Config for Kubeflow 1.3

In Kubeflow 1.3, Jupyter Web App's ConfigMap in the deploy overlay has changed, and a rebase will result in an invalid configuration. To upgrade the Jupyter Web App configuration:

  1. Reset the configuration to the default upstream one:

    root@rok-tools:~/ops/deployments# cp kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/ekf/patches/config-map.yaml kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
    
  2. Commit your changes:

    root@rok-tools:~/ops/deployments# git commit -am "kubeflow: Reset Jupyter Web App config to 1.3 upstream"
    
  3. View your previous changes, so that you can easily apply them again:

    root@rok-tools:~/ops/deployments# git diff origin/release-1.1...$OLD_BRANCH -- kubeflow/manifests/jupyter/jupyter-web-app/overlays/deploy/patches/config-map.yaml
    
  4. Edit the Jupyter Web App configuration and re-apply your old changes, as you saw them above:

    root@rok-tools:~/ops/deployments# vim kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
    

    Important

    In Kubeflow 1.3, the spawnerFormDefaults.image.readOnly field was renamed to spawnerFormDefaults.allowCustomImage. If you have changed the spawnerFormDefaults.image.readOnly field, make sure to modify spawnerFormDefaults.allowCustomImage accordingly.

  5. Commit the new configuration:

    root@rok-tools:~/ops/deployments# git commit -am "kubeflow: Update Jupyter Web App config"
    

Drain rok-csi nodes

To ensure minimal disruption of Rok services, please follow the following instructions to drain Rok CSI nodes, and wait for any pending Rok CSI operations to complete, before performing the upgrade.

During the upgrade, any pending Rok tasks will be canceled, so it is advisable to run the following steps in a period of inactivity, e.g., when no pipelines or snapshot policies run. Since pausing/queuing everything is currently not an option, one can monitor Rok logs and wait until nothing has been logged for, let's say, 30 secs:

root@rok-tools:~/ops/deployments# kubectl -n rok logs -l app=rok-csi-controller -c csi-controller -f --tail=100

Note

Finding a period of inactivity is an ideal scenario, that depending on the deployment may not be feasible, e.g., when having tens of recurring pipelines running. In such a case the end-user will simply see some of them fail.

  1. Scale down the rok-operator StatefulSet:

    root@rok-tools:~/ops/deployments# kubectl -n rok-system scale sts rok-operator --replicas=0
    
  2. Ensure rok-operator scaled down to zero:

    root@rok-tools:~/ops/deployments# kubectl get sts rok-operator -n rok-system
    
  3. Scale down the rok-csi-controller StatefulSet:

    root@rok-tools:~/ops/deployments# kubectl -n rok scale sts rok-csi-controller --replicas=0
    
  4. Ensure rok-csi-controller scaled down to zero:

    root@rok-tools:~/ops/deployments# kubectl get sts rok-csi-controller -n rok
    
  5. Watch the rok-csi-node logs and ensure that all pending operations have finished, i.e., nothing has been logged for the last 30 secs:

    root@rok-tools:~/ops/deployments# kubectl -n rok logs -l app=rok-csi-node -c csi-node -f --tail=100
    
  6. Continue with the Upgrade components section.

Upgrade components

We assume that you are already running a 1.1 Rok cluster on Kubernetes and that you also have access to the 1.4.4 kustomization tree you are upgrading to.

Since a Rok cluster on Kubernetes consists of multiple components, you need to upgrade each one of them. Throughout the guide, we will keep track of these components, as listed in the table below:

Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

During the upgrade, Rok Operator will remove all members from the cluster and add a dedicated one to perform the upgrade. The cluster will be scaled down to zero and a Kubernetes Job will run to upgrade the cluster config on etcd and run any needed migrations. Finally, the cluster will be scaled back up to its initial size.

1. Increase observability (optional)

To gain insight into the status of the cluster upgrade execute the following commands in a separate window:

  • For live cluster status:

    root@rok-tools:~/ops/deployments# watch kubectl get rokcluster -n rok
    
  • For live cluster events:

    root@rok-tools:~/ops/deployments# watch 'kubectl describe rokcluster -n rok rok | tail -n 20'
    

2. Inspect current version (optional)

Get current images and version from the RokCluster CR:

root@rok-tools:~/ops/deployments# kubectl describe rokcluster rok -n rok
...
Spec:
  Images:
    Rok:      gcr.io/arrikto-deploy/roke:l0-release-v1.1
    Rok CSI:  gcr.io/arrikto-deploy/rok-csi:l0-release-v1.1
Status:
  Version:        release-1.1-l0-release-1.1

3. Upgrade Rok Disk Manager

Apply the latest Rok Disk Manager manifests:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-disk-manager/overlays/deploy
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

4. Upgrade Rok kmod

Apply the latest Rok kmod manifests:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-kmod/overlays/deploy
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

5. Upgrade Rok cluster

Apply the latest Rok cluster manifests:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-cluster/overlays/deploy
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

6. Upgrade Rok Operator

Apply the latest Operator manifests:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-operator/overlays/deploy

Note

The above command also updates the RokCluster CRD

After the manifests have been applied, ensure Rok Operator has become ready by running the following command:

root@rok-tools:~/ops/deployments# watch kubectl get pods -n rok-system -l app=rok-operator
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

7. Verify successful upgrade for Rok

  1. Check the status of the cluster upgrade Job:

    root@rok-tools:~/ops/deployments# kubectl get job -n rok rok-upgrade-release-1.4-l0-release-1.4.4
    
  2. Ensure that Rok is up and running after the upgrade Job finishes:

    root@rok-tools:~/ops/deployments# kubectl get rokcluster -n rok rok
    
  3. Ensure all pods in the rok-system namespace are up-and-running:

    root@rok-tools:~/ops/deployments# kubectl get pods -n rok-system
    
  4. Ensure all pods in the rok namespace are up-and-running:

    root@rok-tools:~/ops/deployments# kubectl get pods -n rok
    

Upgrade NGINX Ingress Controller

This section describes how to upgrade the NGINX Ingress Controller. Run the following command to upgrade it:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/nginx-ingress-controller/overlays/deploy/

Upgrade Istio

Rok 1.4.4 uses Istio 1.9.5. To upgrade from Istio 1.5.7 follow the next steps:

  1. Delete the previous Istio control plane installation:

    root@rok-tools:~/ops/deployments# rok-deploy --delete \
    > rok/rok-external-services/istio/istio-1-5-7/istio-install-1-5-7/overlays/deploy \
    > rok/rok-external-services/istio/istio-1-5-7/cluster-local-gateway-1-5-7/overlays/deploy
    
  2. Apply the new Istio control plane:

    root@rok-tools:~/ops/deployments# rok-deploy --apply \
    > rok/rok-external-services/istio/istio-1-9/istio-crds/overlays/deploy \
    > rok/rok-external-services/istio/istio-1-9/istio-namespace/overlays/deploy \
    > rok/rok-external-services/istio/istio-1-9/istio-install/overlays/deploy \
    > rok/rok-external-services/istio/istio-1-9/cluster-local-gateway/overlays/deploy
    
  3. Delete deprecated resources:

    root@rok-tools:~/ops/deployments# rok-kf-prune --app istio
    
  4. Confirm that the knative-serving and kubeflow namespaces, as well as all of the kubeflow user namespaces (namespaces that start with kubeflow-) have Istio sidecar injection enabled. To do this, run the following command and confirm that these namespaces show up in the command's output:

    root@rok-tools:~/ops/deployments# kubectl get ns -l istio-injection=enabled
    NAME                        STATUS   AGE
    knative-serving             Active   5d16h
    kubeflow                    Active   5d16h
    kubeflow-user               Active   5d16h
    ...
    
  5. Upgrade the Istio sidecars, by deleting all Pods in the namespaces you found above. Istio will inject the new version sidecar once the owning controllers recreate the deleted Pods:

    root@rok-tools:~/ops/deployments# kubectl get ns -l istio-injection=enabled --no-headers | \
    >     awk '{print $1}' | \
    >         xargs -n1 -I {} kubectl delete pod --all -n {}
    
  6. Follow the Expose Istio guide from scratch to reconfigure and re-apply the necessary resources. Choose based on your cloud provider and the load balancer type you use.

  7. Restart the AuthService:

    root@rok-tools:~/ops/deployments# kubectl rollout restart statefulset -n istio-system authservice
    

    Important

    At this point Authservice will not be able to talk to Dex due to a missing AuthorizationPolicy. The pod will not become ready until you upgrade Kubeflow.

Upgrade Kubeflow manifests

Important

Kubeflow 1.3 includes a new version of Katib that is not backwards-compatible with previous Kubeflow versions. This means that you will lose all Experiment, Suggestion and Trial CRs. If there are hyperparameter tuning jobs in-progress, they will be deleted.

This section describes how to upgrade Kubeflow. If you have not deployed Kubeflow in your cluster, you can safely skip this section.

Run the following command to update your Kubeflow installation:

root@rok-tools:~/ops/deployments# rok-deploy --apply install/kubeflow --force --force-kinds CustomResourceDefinition Deployment StatefulSet

Restart Kubeflow Conversion Webhooks

During the upgrade, we update the KFServing and KNative CRs using conversion webhooks. Restart the corresponding Pods to allow Kubernetes to re-establish the connection with these webhooks.

  1. Delete the KFserving webhook Pod:

    root@rok-tools:~/ops/deployments# kubectl delete pods \
    >    -n kubeflow kfserving-controller-manager-0
    
  2. Delete the KNative webhook Pod:

    root@rok-tools:~/ops/deployments# kubectl delete pods \
    >    -n knative-serving -l role=webhook
    

Restart Kubeflow Admission Webhook

During the upgrade, we regenerate the Certificate for the admission webhook in order to change the Issuer. Restart the admission-webhook deployment so that it uses the new Certificate:

root@rok-tools:~/ops/deployments# kubectl rollout restart \
>     -n kubeflow deploy admission-webhook-deployment

Delete stale Kubeflow resources

Run the following command to remove the deprecated resources left by the previous version of Kubeflow:

root@rok-tools:~/ops/deployments# rok-kf-prune --app kubeflow

Upgrade Notebooks for Kubeflow 1.3

Restart all Notebooks with access to Kubeflow Pipelines, in order to inject the new authentication token needed by the latest version of Kubeflow Pipelines:

root@rok-tools:~/ops/deployments# kubectl delete pods -l 'access-ml-pipeline=true, notebook-name' --all-namespaces

Verify successful upgrade

Follow the Test Kubeflow section to validate the updated Rok + EKF deployment.