Rok 1.4 (unreleased)

This guide describes the necessary steps to upgrade an existing Rok cluster on Kubernetes from version 1.3 to the latest pre-release version, 1.4-rc8.

Check Kubernetes Version

Rok 1.4-rc8 only supports Kubernetes version 1.18 or 1.19. Follow the instructions bellow to verify the Kubernetes version for your cluster, before continuing with the upgrade.

  1. Check your cluster version by inspecting the value of Server Version in the following command:

    root@rok-tools:/# kubectl version --short
    Client Version: v1.18.19
    Server Version: v1.19.13-eks-8df270
    
  2. If your Server Version is v1.18.* or v1.19.*, you may proceed with the upgrade. If not, please follow the upgrade your cluster guide to upgrade your cluster to Kubernetes 1.18 or 1.19.

Upgrade your management environment

We assume that you have followed the Deploy Rok Components guide, and have successfully set up a full-fledged rok-tools management environment either in local Docker or in Kubernetes.

Before proceeding with the core upgrade steps you need to first upgrade your management environment, in order to use CLI tools and utilities, such as rok-deploy that are compatible with the Rok version you are upgrading to.

Important

When you upgrade your management environment all previous data (GitOps repository, files, user settings, etc.) are preserved either in a Docker volume or Kubernetes PVC, depending on your environment. This volume or PVC is mounted in the new rok-tools container so that old data is adopted.

For Kubernetes simply apply the latest rok-tools manifests:

$ kubectl apply -f <download_root>/rok-tools-eks.yaml

Note

In case you see the following error:

The StatefulSet "rok-tools" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'template', and
'updateStrategy' are forbidden

make sure your first delete the existing rok-tools StatefulSet with:

root@rok-tools:/# kubectl delete sts rok-tools

and then re-apply.

For Docker first delete the old container:

root@rok-tools:/# docker stop <OLD_ROK_TOOLS_CONTAINER_ID>
root@rok-tools:/# docker rm <OLD_ROK_TOOLS_CONTAINER_ID>

and then create a new one with previous data and the new image:

root@rok-tools:/# docker run -ti \
>     --name rok-tools \
>     --hostname rok-tools \
>     -p 8080:8080 \
>     --entrypoint /bin/bash \
>     -v $(pwd)/rok-tools-data:/root \
>     -v /var/run/docker.sock:/var/run/docker.sock \
>     -w /root \
>     gcr.io/arrikto/rok-tools:release-1.4-l0-release-1.4-rc8

Upgrade manifests

We assume that you have followed the Deploy Rok Components guide, and have a local GitOps repo with Arrikto-provided manifests. Once Arrikto releases a new Rok version and pushes updated deployment manifests, you have to follow the standard GitOps workflow:

  1. Fetch latest upstream changes, pushed by Arrikto.
  2. Rebase local changes on top of the latest upstream ones and resolve conflicts, if any.
  3. Edit manifests based on Arrikto-provided instructions, if necessary.
  4. Commit everything.
  5. Re-apply manifests.

When one initially deploys Rok on Kubernetes, either automatically using rok-deploy or manually, they end up with a deploy overlay in each Rok component or external service that is to be applied to Kubernetes. In the GitOps deployment repository, Arrikto provides manifests that include the deploy overlay in each Kustomize app/package as scaffold, so that users can quickly start and set their preferences.

As a result, fetch/rebase might lead to conflicts since both Arrikto and the end-user might modify the same files that are tracked by Git. In this scenario, the most common and obvious solution is to keep the user's changes since they are the ones that reflect the existing deployment.

In case of breaking changes, e.g., parts of YAML documents that are absolutely necessary to perform the upgrade, or others that might be deprecated, Arrikto will inform users via version-specific upgrade nodes for all actions that need to be taken.

Note

It is the user's responsibility to apply valid manifests and kustomizations after a rebase. In case of uncertainty do not hesitate to coordinate with Arrikto's Tech Team for support.

We will use git to update local manifests. You are about to rebase your work on top of latest pre-release branch. To favor local changes upon conflicts, we will use the corresponding merge strategy option.

Important

Make sure you mirror the GitOps repo to a private remote to be able to recover it in case of failure.

To upgrade the manifests:

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:/# cd ~/ops/deployments
    
  2. Fetch latest upstream changes:

    root@rok-tools:~/ops/deployments# git fetch --all -p
    Fetching origin
    
  3. Ensure the release channel you are currently following is release-1.3:

    root@rok-tools:~/ops/deployments# git rev-parse --abbrev-ref --symbolic-full-name @{u}
    origin/release-1.3
    

    If you are following the release-1.4 release channel already, you can skip to step 5.

  4. Follow the Switch release channel section to update to the release-1.4 release channel. You can skip this step if you are already in the release-1.4 release channel.

  5. Rebase on top of the latest pre-release version:

    root@rok-tools:~/ops/deployments# git rebase -Xtheirs
    

    Troubleshooting

    CONFLICT (modify/delete)

    Rebasing your work may cause conflicts when you have modified a file that has been removed from the latest version of Arrikto manifests. In such a case the rebase process will fail with:

    CONFLICT (modify/delete): path/to/file deleted in origin/release-1.4 and modified in HEAD~X. Version HEAD~X of path/to/file left in tree.
    
    1. Go ahead and delete those files:

      root@rok-tools:~/ops/deployments# git status --porcelain | \
      >     awk '{if ($1=="DU") print $2}' | \
      >     xargs git rm
      
    2. Continue the rebase process:

      root@rok-tools:~/ops/deployments# git rebase --continue
      
  6. (Optional) Edit deploy overlays based on version-specific upgrade notes.

Migrate from namespace-resources to skel-resources

In Arrikto EKF 1.4, we have introduced skel-resources, an automated way of managing resources available to all user namespaces. To migrate from the namespace-resources to the skel-resources:

  1. Switch to your GitOps repository:

    root@rok-tools:/# cd ~/ops/deployments
    
  2. Run the migration tool and follow the on-screen instructions:

    root@rok-tools:~/ops/deployments# rok-skel-migrate
    

    Troubleshooting

    Modified default resources under namespace-resources/base

    In Arrikto EKF 1.4, all resources previously shipped under kubeflow/manifests/common/namespace-resources/base are now shipped under kubeflow/manifests/common/skel-resources/base. The skel-resources kustomization package provides more flexibility as it supports a deploy overlay. However, to ensure that you get future updates to skel-resources, you should not modify its base, and make any modification to the deploy overlay instead.

    In case you have made modification to those resources, rok-skel-migrate will detect it and warn you. It also provides a diff, available under ~/.rok/skel-migrate, which you can view with the following command:

    root@rok-tools:/# cat <diff-file> | colordiff
    

    Replace <diff-file> with the actual name of the output file, for example:

    root@rok-tools:/# cat ae231133-a1b1fa26.diff | colordiff
    

    For each modified resource you have to:

    1. Create a Kustomize patch under skel-resources/overlays/deploy.
    2. Delete it from kubeflow/manifests/common/namespace-resources/base.
  3. Review the changes the migration tool did:

    1. Ensure the kubeflow/manifests/common/namespace-resources directory has the following structure:

      root@rok-tools:~/ops/deployments# tree kubeflow/manifests/common/namespace-resources
      kubeflow/manifests/common/namespace-resources
      |-- base
      |   `-- kustomization.yaml
      |-- kustomization.yaml.j2
      |-- profile.yaml.j2
      |-- rok-task-runner-role-binding.yaml.j2
      `-- snapshot-policy.yaml.j2
      

      Note

      If you have manually created any Profile CRs in the past, you should also have a profiles directory under kubeflow/manifests/common/namespace-resources, containing your Profile CRs.

    2. Ensure the kubeflow/manifests/common/namespace-resources/base/kustomization.yaml file is a no-op:

      root@rok-tools:~/ops/deployments# cat kubeflow/manifests/common/namespace-resources/base/kustomization.yaml
      apiVersion: kustomize.config.k8s.io/v1beta1
      kind: Kustomization
      
    3. Ensure the kubeflow/manifests/common/skel-resources directory has the following structure:

      root@rok-tools:~/ops/deployments# tree kubeflow/manifests/common/skel-resources
      kubeflow/manifests/common/skel-resources
      |-- base
      |   |-- kustomization.yaml
      |   |-- kustomizeconfig
      |   |   `-- images.yaml
      |   |-- namespace.yaml
      |   |-- notebook-backup.yaml
      |   |-- params.env
      |   |-- params.yaml
      |   |-- pipeline-runner.yaml
      |   |-- poddefault-kale-python-image.yaml
      |   |-- poddefault-ml-pipeline.yaml
      |   |-- poddefault-rok.yaml
      |   |-- rok-task-runner.yaml
      |   `-- secret-mlpipeline-minio.yaml
      `-- overlays
          `-- deploy
              |-- kustomization.yaml
              |-- ...  <-- Lines with your extra resources
              `-- patches
                  `-- notebook-backup.yaml
      
    4. Ensure the kubeflow/manifests/common/skel-resources/overlays/deploy/kustomization.yaml file includes the modifications you had previously done to kubeflow/manifests/common/namespace-resources/base/kustomization.yaml:

      root@rok-tools:~/ops/deployments# cat kubeflow/manifests/common/skel-resources/overlays/deploy/kustomization.yaml
      ...
      resources:
      - ../../base
      - ...  <-- Lines with your extra resources
      ...
      
    5. Ensure there is no change under kubeflow/manifests/common/skel-resources/base, that is, the following command produces no output:

      root@rok-tools:~/ops/deployments# git status | grep skel-resources/base
      
    6. Ensure there are no commits in your Git history modifying kubeflow/manifests/common/skel-resources/base:

      root@rok-tools:~/ops/deployments# git log --oneline \
      >   origin/release-1.4..HEAD -- kubeflow/manifests/common/skel-resources/base | \
      >   wc -l
      0
      
  4. Stage your changes:

    root@rok-tools:~/ops/deployments# git add \
    >   kubeflow/manifests/common/namespace-resources \
    >   kubeflow/manifests/common/skel-resources
    
  5. Commit your changes:

    root@rok-tools:~/ops/deployments# git commit -m "Migrate to skel-resources"
    

Update Jupyter Web App Config for Kubeflow 1.4

In Kubeflow 1.4, Jupyter Web App's ConfigMap in the deploy overlay has changed, and a rebase will result in an invalid configuration. To upgrade the Jupyter Web App configuration:

  1. Reset the configuration to the default upstream one:

    root@rok-tools:~/ops/deployments# git filter-branch --prune-empty --index-filter \
    >   'git checkout origin/release-1.4... -- kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml' \
    >   origin/release-1.4..HEAD
    Ref 'refs/heads/release-1.4' was rewritten
    
  2. Ensure there are no custom commits in your Git history modifying kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml:

    root@rok-tools:~/ops/deployments# git log --oneline origin/release-1.4..HEAD -- \
    >   kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml | wc -l
    0
    
  3. Remove the backup reference:

    root@rok-tools:~/ops/deployments# git update-ref -d refs/original/refs/heads/release-1.4
    
  4. View your previous changes, so that you can easily apply them again:

    root@rok-tools:~/ops/deployments# git diff origin/release-1.3...release-1.3 -- \
    >   kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
    
  5. Edit the Jupyter Web App configuration and re-apply your old changes, as you saw them above:

    root@rok-tools:~/ops/deployments# vim kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
    
  6. Commit the new configuration:

    root@rok-tools:~/ops/deployments# git commit -am "kubeflow: Update Jupyter Web App config"
    

Upgrade Rok

We assume that you are already running a 1.3 Rok cluster on Kubernetes and that you also have access to the 1.4-rc8 kustomization tree you are upgrading to.

Since a Rok cluster on Kubernetes consists of multiple components, you need to upgrade each one of them. Throughout the guide, we will keep track of these components, as listed in the table below:

Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

During the upgrade, Rok Operator will remove all members from the cluster and add a dedicated one to perform the upgrade. The cluster will be scaled down to zero and a Kubernetes Job will run to upgrade the cluster config on etcd and run any needed migrations. Finally, the cluster will be scaled back up to its initial size.

1. Increase observability (optional)

To gain insight into the status of the cluster upgrade execute the following commands in a separate window:

  • For live cluster status:

    root@rok-tools:/# watch kubectl get rokcluster -n rok
    
  • For live cluster events:

    root@rok-tools:/# watch 'kubectl describe rokcluster -n rok rok | tail -n 20'
    

2. Inspect current version (optional)

Get current images and version from the RokCluster CR:

root@rok-tools:/# kubectl describe rokcluster rok -n rok
...
Spec:
  Images:
    Rok:      gcr.io/arrikto-deploy/roke:release-1.3-l0-release-1.3
    Rok CSI:  gcr.io/arrikto-deploy/rok-csi:release-1.3-l0-release-1.3
Status:
  Version:        release-1.3-l0-release-1.3

3. Upgrade Rok Disk Manager

Apply the latest Rok Disk Manager manifests:

root@rok-tools:/# rok-deploy --apply rok/rok-disk-manager/overlays/deploy

After you have applied the manifests, ensure Rok Disk Manager has become ready by running the following command:

root@rok-tools:/# watch kubectl get pods -n rok-system -l name=rok-disk-manager
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

4. Upgrade Rok kmod

Apply the latest Rok kmod manifests:

root@rok-tools:/# rok-deploy --apply rok/rok-kmod/overlays/deploy

After you have applied the manifests, ensure Rok kmod has become ready by running the following command:

root@rok-tools:/# watch kubectl get pods -n rok-system -l app=rok-kmod
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

5. Upgrade Rok Operator

Apply the latest Operator manifests:

root@rok-tools:/# rok-deploy --apply rok/rok-operator/overlays/deploy

Note

The above command also updates the RokCluster CRD

After you have applied the manifests, ensure Rok Operator has become ready by running the following command:

root@rok-tools:/# watch kubectl get pods -n rok-system -l app=rok-operator
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

6. Upgrade Rok cluster

Apply the latest Rok cluster manifests:

root@rok-tools:/# rok-deploy --apply rok/rok-cluster/overlays/deploy

After you have applied the manifests, ensure Rok cluster has been upgraded:

  1. Check the status of the cluster upgrade Job:

    root@rok-tools:/# kubectl get job -n rok rok-upgrade-release-1.4-l0-release-1.4-rc8
    
  2. Ensure that Rok is up and running after the upgrade Job finishes:

    root@rok-tools:/# kubectl get rokcluster -n rok rok
    
Component old new
RokCluster CR  
RokCluster CRD  
Rok Operator  
Rok Disk Manager  
Rok kmod  

7. Upgrade the rest of the Rok installation components

Apply the latest Rok manifests:

root@rok-tools:~/ops/deployments# rok-deploy --apply install/rok

8. Verify successful upgrade for Rok

  1. Ensure all pods in the rok-system namespace are up-and-running:

    root@rok-tools:/# kubectl get pods -n rok-system
    
  2. Ensure all pods in the rok namespace are up-and-running:

    root@rok-tools:/# kubectl get pods -n rok
    
  3. Ensure that Dex is up-and-running. Check that field READY is 1/1:

    root@rok-tools:~/ops/deployments# kubectl get deploy -n auth
    NAME   READY   UP-TO-DATE   AVAILABLE   AGE
    dex    1/1     1            1           1m
    
  4. Ensure that AuthService is up-and-running. Check that field READY is 1/1:

    root@rok-tools:~/ops/deployments# kubectl get sts -n istio-system authservice
    NAME             READY   AVAILABLE   AGE
    authservice      1/1     1           1m
    
  5. Ensure that Reception is up-and-running. Check that field READY is 1/1:

    root@rok-tools:~/ops/deployments# kubectl get deploy -n kubeflow kubeflow-reception
    NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
    kubeflow-reception   1/1     1            1           1m
    
  6. Ensure that the Profiles Controller is up-and-running. Check that field READY is 1/1:

    root@rok-tools:~/ops/deployments# kubectl get deploy -n kubeflow profiles-deployment
    NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
    profiles-deployment   1/1     1            1           1m
    

Upgrade Knative Serving CRs

EKF 1.4 uses Knative Serving 0.22.1 that stopped supporting deprecated CR versions. To be able to apply the new manifests you have to upgrade the storage version of existing Knative Serving CRs to the latest version of the CRD. To do so, follow the steps below:

  1. Create the Knative post-install Job:

    root@rok-tools:~/ops/deployments# rok-deploy --apply \
    >     kubeflow/manifests/common/knative/knative-serving-post-install-jobs/overlays/deploy
    
  2. Wait for the Job to complete:

    root@rok-tools:~/ops/deployments# kubectl wait \
    >     --timeout=10m --for=condition=complete \
    >     -n knative-serving job/storage-version-migration-serving
    job.batch/storage-version-migration-serving condition met
    
  3. Delete the completed job:

    root@rok-tools:~/ops/deployments# rok-deploy --delete \
    >     kubeflow/manifests/common/knative/knative-serving-post-install-jobs/overlays/deploy
    

Upgrade Kubeflow

This section describes how to upgrade Kubeflow. If you have not deployed Kubeflow in your cluster, you can safely skip this section.

Run the following command to update your Kubeflow installation:

root@rok-tools:~/ops/deployments# rok-deploy --apply install/kubeflow \
>     --force --force-kinds Deployment StatefulSet

Upgrade NGINX Ingress Controller

This section describes how to upgrade the NGINX Ingress Controller. Run the following command to upgrade it:

root@rok-tools:~/ops/deployments# rok-deploy --apply \
> rok/nginx-ingress-controller/overlays/deploy/

Upgrade Cluster Autoscaler

In EKF 1.4 we have restructured the kustomization for the Cluster Autoscaler and made it easier to configure. In addition, we now use a custom image, supporting automatic scale-in operations on clusters running Rok. This section describes how to upgrade your Cluster Autoscaler deployment.

Warning

Skip this section and proceed with Delete stale Kubeflow resources, if you have not deployed the Cluster Autoscaler using Arrikto's manifests.

To upgrade your Cluster Autoscaler deployment follow the steps below:

  1. Ensure that after rebase local changes exist only in the deploy overlay:

    root@rok-tools:~/ops/deployments# git diff --name-only \
    >     origin/release-1.4 -- rok/cluster-autoscaler/ | \
    >     grep -q -v -w deploy || echo OK
    OK
    
  2. Undo any local changes in Cluster Autoscaler kustomization.yaml:

    root@rok-tools:~/ops/deployments# git checkout origin/release-1.4 -- \
    >     rok/cluster-autoscaler/overlays/deploy/kustomization.yaml
    
  3. Set the necessary environment.

    1. Specify the name of your EKS cluster. Replace <EKS_CLUSTER> with your EKS cluster name:

      root@rok-tools:~/ops/deployments# export EKS_CLUSTER=<EKS_CLUSTER>
      
    2. Obtain the AWS account ID:

      root@rok-tools:~/ops/deployments# export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
      
    3. Obtain the name for the IAM Role that Cluster Autoscaler is currently using:

      root@rok-tools:~/ops/deployments# export AUTOSCALER_EKS_IAM_ROLE=$(kubectl get sa \
      >     -n kube-system cluster-autoscaler \
      >     -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}' | \
      >     cut -d/ -f2) && echo ${AUTOSCALER_EKS_IAM_ROLE?}
      cluster-autoscaler-arrikto-cluster
      
  4. Follow the Deploy Cluster Autoscaler guide from scratch to reconfigure and re-apply the necessary resources.

Delete stale Kubeflow resources

Run the following command to remove the deprecated resources left by the previous version of Kubeflow:

root@rok-tools:~/ops/deployments# rok-kf-prune --app kubeflow

Upgrade Notebooks for EKF 1.4

In Kubeflow 1.4, we have updated the access-ml-pipeline PodDefault to work with the latest KFP SDK that the new Jupyter Kale image ships with, and the rok-auth PodDefault to work with the latest changes introduced in Rok 1.4. This means that existing Notebooks running with the old Jupyter Kale image:

  • will not be able to use Rok to create snapshots.
  • will be unable to access Kubeflow Pipelines.
  • will fail to use InferenceServices for predictions.
  • will produce pipeline runs with broken visualizations.

This section will guide you through upgrading all existing Notebooks that use an old Jupyter Kale image. To do so

  1. Specify the old notebook image:

    rok@rok-tools:~/ops/deployments# export IMAGE_FILTER="*jupyter-kale-py36:*"
    
  2. Specify the new notebook image:

    root@rok-tools:~/ops/deployments# export IMAGE=gcr.io/arrikto/jupyter-kale-py36:release-1.4-l0-release-1.4-rc2-14-g394bad1e6
    
  3. Upgrade the existing Notebooks to use the new image:

    root@rok-tools:~/ops/deployments# rok-notebook-upgrade \
    >    --filter-image "${IMAGE_FILTER?}" \
    >    --image "${IMAGE?}"
    

Important

The above process updates only Notebooks using the jupyter-kale-py36 images. Notebooks with custom images should be upgraded manually to use an image with KFP SDK >= 1.7.0 and the latest version of Kale.

Upgrade Notebook snapshot policies

Snapshot policies in Rok 1.4 directly snapshot Notebook CRs instead of Pods belonging to Notebooks. This means that the filters of snapshot policies for Notebooks should now use Notebook names rather than Pod names. Since Notebooks run as StatefulSets in Kubernetes, the difference between a Pod and a Notebook name is typically that the Pod name includes a -0 suffix.

Upgrading to Rok 1.4 will automatically update any existing equal or not_equal filters of Notebook snapshot policies to remove the -0 suffix. While this should cover most common cases, if you have existing policies that define other filters such as starts_with or ends_with that include the name of a Pod, you should manually update them to remove the -0 suffix from the filter's value.

Verify successful upgrade

Follow the Test Kubeflow section to validate the updated Rok + EKF deployment.