Rok 1.4¶
This guide describes the necessary steps to upgrade an existing Rok cluster on Kubernetes from version 1.3 to the latest version, 1.4.4.
Check Kubernetes Version¶
Rok 1.4.4 only supports Kubernetes version 1.18 or 1.19. Follow the instructions bellow to verify the Kubernetes version for your cluster, before continuing with the upgrade.
Check your cluster version by inspecting the value of
Server Version
in the following command:root@rok-tools:/# kubectl version --short Client Version: v1.18.19 Server Version: v1.19.13-eks-8df270
If your Server Version is
v1.18.*
orv1.19.*
, you may proceed with the upgrade. If not, please follow the upgrade your cluster guide to upgrade your cluster to Kubernetes 1.18 or 1.19.
Upgrade your management environment¶
We assume that you have followed the Deploy Rok Components guide, and have
successfully set up a full-fledged rok-tools
management environment either
in local Docker or in Kubernetes.
Before proceeding with the core upgrade steps you need to first upgrade your
management environment, in order to use CLI tools and utilities, such as
rok-deploy
that are compatible with the Rok version you are upgrading to.
Important
When you upgrade your management environment all previous data (GitOps
repository, files, user settings, etc.) are preserved either in a Docker
volume or Kubernetes PVC, depending on your environment. This volume or PVC
is mounted in the new rok-tools
container so that old data is adopted.
For Kubernetes simply apply the latest rok-tools
manifests:
$ kubectl apply -f <download_root>/rok-tools-eks.yaml
Note
In case you see the following error:
The StatefulSet "rok-tools" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'template', and
'updateStrategy' are forbidden
make sure your first delete the existing rok-tools
StatefulSet with:
root@rok-tools:/# kubectl delete sts rok-tools
and then re-apply.
For Docker first delete the old container:
root@rok-tools:/# docker stop <OLD_ROK_TOOLS_CONTAINER_ID>
root@rok-tools:/# docker rm <OLD_ROK_TOOLS_CONTAINER_ID>
and then create a new one with previous data and the new image:
root@rok-tools:/# docker run -ti \
> --name rok-tools \
> --hostname rok-tools \
> -p 8080:8080 \
> --entrypoint /bin/bash \
> -v $(pwd)/rok-tools-data:/root \
> -v /var/run/docker.sock:/var/run/docker.sock \
> -w /root \
> gcr.io/arrikto/rok-tools:release-1.4-l0-release-1.4.4
Upgrade manifests¶
We assume that you have followed the Deploy Rok Components guide, and have a local GitOps repo with Arrikto-provided manifests. Once Arrikto releases a new Rok version and pushes updated deployment manifests, you have to follow the standard GitOps workflow:
- Fetch latest upstream changes, pushed by Arrikto.
- Rebase local changes on top of the latest upstream ones and resolve conflicts, if any.
- Edit manifests based on Arrikto-provided instructions, if necessary.
- Commit everything.
- Re-apply manifests.
When one initially deploys Rok on Kubernetes, either automatically using
rok-deploy
or manually, they end up with a deploy
overlay in each Rok
component or external service that is to be applied to Kubernetes. In the GitOps
deployment repository, Arrikto provides manifests that include the deploy
overlay in each Kustomize app/package as scaffold, so that users can quickly
start and set their preferences.
As a result, fetch/rebase might lead to conflicts since both Arrikto and the end-user might modify the same files that are tracked by Git. In this scenario, the most common and obvious solution is to keep the user's changes since they are the ones that reflect the existing deployment.
In case of breaking changes, e.g., parts of YAML documents that are absolutely necessary to perform the upgrade, or others that might be deprecated, Arrikto will inform users via version-specific upgrade nodes for all actions that need to be taken.
Note
It is the user's responsibility to apply valid manifests and kustomizations after a rebase. In case of uncertainty do not hesitate to coordinate with Arrikto's Tech Team for support.
We will use git
to update local manifests. You are about to rebase your work
on top of latest pre-release branch. To favor local changes upon conflicts, we
will use the corresponding merge strategy option.
Important
Make sure you mirror the GitOps repo to a private remote to be able to recover it in case of failure.
To upgrade the manifests:
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:/# cd ~/ops/deployments
Fetch latest upstream changes:
root@rok-tools:~/ops/deployments# git fetch --all -p Fetching origin
Ensure the release channel you are currently following is
release-1.3
:root@rok-tools:~/ops/deployments# git rev-parse --abbrev-ref --symbolic-full-name @{u} origin/release-1.3
If you are following the
release-1.4
release channel already, you can skip to step 5.Follow the Switch release channel section to update to the
release-1.4
release channel. You can skip this step if you are already in therelease-1.4
release channel.Rebase on top of the latest pre-release version:
root@rok-tools:~/ops/deployments# git rebase -Xtheirs
Troubleshooting
CONFLICT (modify/delete)
Rebasing your work may cause conflicts when you have modified a file that has been removed from the latest version of Arrikto manifests. In such a case the rebase process will fail with:
CONFLICT (modify/delete): path/to/file deleted in origin/release-1.4 and modified in HEAD~X. Version HEAD~X of path/to/file left in tree.
Go ahead and delete those files:
root@rok-tools:~/ops/deployments# git status --porcelain | \ > awk '{if ($1=="DU") print $2}' | \ > xargs git rm
Continue the rebase process:
root@rok-tools:~/ops/deployments# git rebase --continue
Migrate from namespace-resources to skel-resources¶
In Arrikto EKF 1.4, we have introduced skel-resources
, an automated way of
managing resources available to all user namespaces. To migrate from the
namespace-resources
to the skel-resources
:
Switch to your GitOps repository:
root@rok-tools:/# cd ~/ops/deployments
Run the migration tool and follow the on-screen instructions:
root@rok-tools:~/ops/deployments# rok-skel-migrate
Troubleshooting
Modified default resources under
namespace-resources/base
In Arrikto EKF 1.4, all resources previously shipped under
kubeflow/manifests/common/namespace-resources/base
are now shipped underkubeflow/manifests/common/skel-resources/base
. Theskel-resources
kustomization package provides more flexibility as it supports adeploy
overlay. However, to ensure that you get future updates toskel-resources
, you should not modify itsbase
, and make any modification to thedeploy
overlay instead.In case you have made modification to those resources,
rok-skel-migrate
will detect it and warn you. It also provides a diff, available under ~/.rok/skel-migrate, which you can view with the following command:root@rok-tools:/# cat <diff-file> | colordiff
Replace <diff-file> with the actual name of the output file, for example:
root@rok-tools:/# cat ae231133-a1b1fa26.diff | colordiff
For each modified resource you have to:
- Create a Kustomize patch under
skel-resources/overlays/deploy
. - Delete it from
kubeflow/manifests/common/namespace-resources/base
.
- Create a Kustomize patch under
Review the changes the migration tool did:
Ensure the
kubeflow/manifests/common/namespace-resources
directory has the following structure:root@rok-tools:~/ops/deployments# tree kubeflow/manifests/common/namespace-resources kubeflow/manifests/common/namespace-resources |-- base | `-- kustomization.yaml |-- kustomization.yaml.j2 |-- profile.yaml.j2 |-- rok-task-runner-role-binding.yaml.j2 `-- snapshot-policy.yaml.j2
Note
If you have manually created any
Profile
CRs in the past, you should also have aprofiles
directory underkubeflow/manifests/common/namespace-resources
, containing yourProfile
CRs.Ensure the
kubeflow/manifests/common/namespace-resources/base/kustomization.yaml
file is a no-op:root@rok-tools:~/ops/deployments# cat kubeflow/manifests/common/namespace-resources/base/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization
Ensure the
kubeflow/manifests/common/skel-resources
directory has the following structure:root@rok-tools:~/ops/deployments# tree kubeflow/manifests/common/skel-resources kubeflow/manifests/common/skel-resources |-- base | |-- kustomization.yaml | |-- kustomizeconfig | | `-- images.yaml | |-- namespace.yaml | |-- notebook-backup.yaml | |-- params.env | |-- params.yaml | |-- pipeline-runner.yaml | |-- poddefault-kale-python-image.yaml | |-- poddefault-ml-pipeline.yaml | |-- poddefault-rok.yaml | |-- rok-task-runner.yaml | `-- secret-mlpipeline-minio.yaml `-- overlays `-- deploy |-- kustomization.yaml |-- ... <-- Lines with your extra resources `-- patches `-- notebook-backup.yaml
Ensure the
kubeflow/manifests/common/skel-resources/overlays/deploy/kustomization.yaml
file includes the modifications you had previously done tokubeflow/manifests/common/namespace-resources/base/kustomization.yaml
:root@rok-tools:~/ops/deployments# cat kubeflow/manifests/common/skel-resources/overlays/deploy/kustomization.yaml ... resources: - ../../base - ... <-- Lines with your extra resources ...
Ensure there is no change under
kubeflow/manifests/common/skel-resources/base
, that is, the following command produces no output:root@rok-tools:~/ops/deployments# git status | grep skel-resources/base
Ensure there are no commits in your Git history modifying
kubeflow/manifests/common/skel-resources/base
:root@rok-tools:~/ops/deployments# git log --oneline \ > origin/release-1.4..HEAD -- kubeflow/manifests/common/skel-resources/base | \ > wc -l 0
Stage your changes:
root@rok-tools:~/ops/deployments# git add \ > kubeflow/manifests/common/namespace-resources \ > kubeflow/manifests/common/skel-resources
Commit your changes:
root@rok-tools:~/ops/deployments# git commit -m "Migrate to skel-resources"
Update Jupyter Web App Config for Kubeflow 1.4¶
In Kubeflow 1.4, Jupyter Web App's ConfigMap in the deploy overlay has changed, and a rebase will result in an invalid configuration. To upgrade the Jupyter Web App configuration:
Reset the configuration to the default upstream one:
root@rok-tools:~/ops/deployments# git filter-branch --prune-empty --index-filter \ > 'git checkout origin/release-1.4... -- kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml' \ > origin/release-1.4..HEAD Ref 'refs/heads/release-1.4' was rewritten
Ensure there are no custom commits in your Git history modifying
kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
:root@rok-tools:~/ops/deployments# git log --oneline origin/release-1.4..HEAD -- \ > kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml | wc -l 0
Remove the backup reference:
root@rok-tools:~/ops/deployments# git update-ref -d refs/original/refs/heads/release-1.4
View your previous changes, so that you can easily apply them again:
root@rok-tools:~/ops/deployments# git diff origin/release-1.3...release-1.3 -- \ > kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
Edit the Jupyter Web App configuration and re-apply your old changes, as you saw them above:
root@rok-tools:~/ops/deployments# vim kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
Commit the new configuration:
root@rok-tools:~/ops/deployments# git commit -am "kubeflow: Update Jupyter Web App config"
Air Gapped
- Follow the Mirror Images to Internal Registry guide to mirror all new images to your internal Docker registry.
- Follow the Patch All Images for Your Deployment guide to patch all kustomizations to use the mirrored images from your internal Docker registry.
Upgrade Rok¶
We assume that you are already running a 1.3 Rok cluster on Kubernetes and that you also have access to the 1.4.4 kustomization tree you are upgrading to.
Since a Rok cluster on Kubernetes consists of multiple components, you need to upgrade each one of them. Throughout the guide, we will keep track of these components, as listed in the table below:
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
During the upgrade, Rok Operator will remove all members from the cluster and add a dedicated one to perform the upgrade. The cluster will be scaled down to zero and a Kubernetes Job will run to upgrade the cluster config on etcd and run any needed migrations. Finally, the cluster will be scaled back up to its initial size.
1. Increase observability (optional)¶
To gain insight into the status of the cluster upgrade execute the following commands in a separate window:
For live cluster status:
root@rok-tools:/# watch kubectl get rokcluster -n rok
For live cluster events:
root@rok-tools:/# watch 'kubectl describe rokcluster -n rok rok | tail -n 20'
2. Inspect current version (optional)¶
Get current images and version from the RokCluster CR:
root@rok-tools:/# kubectl describe rokcluster rok -n rok
...
Spec:
Images:
Rok: gcr.io/arrikto-deploy/roke:release-1.3-l0-release-1.3
Rok CSI: gcr.io/arrikto-deploy/rok-csi:release-1.3-l0-release-1.3
Status:
Version: release-1.3-l0-release-1.3
3. Upgrade Rok Disk Manager¶
Apply the latest Rok Disk Manager manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-disk-manager/overlays/deploy
After you have applied the manifests, ensure Rok Disk Manager has become ready by running the following command:
root@rok-tools:/# watch kubectl get pods -n rok-system -l name=rok-disk-manager
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
4. Upgrade Rok kmod¶
Apply the latest Rok kmod manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-kmod/overlays/deploy
After you have applied the manifests, ensure Rok kmod has become ready by running the following command:
root@rok-tools:/# watch kubectl get pods -n rok-system -l app=rok-kmod
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
5. Upgrade Rok Operator¶
Apply the latest Operator manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-operator/overlays/deploy
Note
The above command also updates the RokCluster
CRD
After you have applied the manifests, ensure Rok Operator has become ready by running the following command:
root@rok-tools:/# watch kubectl get pods -n rok-system -l app=rok-operator
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
6. Upgrade Rok cluster¶
Apply the latest Rok cluster manifests:
root@rok-tools:/# rok-deploy --apply rok/rok-cluster/overlays/deploy
After you have applied the manifests, ensure Rok cluster has been upgraded:
Check the status of the cluster upgrade Job:
root@rok-tools:/# kubectl get job -n rok rok-upgrade-release-1.4-l0-release-1.4.4Ensure that Rok is up and running after the upgrade Job finishes:
root@rok-tools:/# kubectl get rokcluster -n rok rok
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
7. Upgrade the rest of the Rok installation components¶
Apply the latest Rok manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply install/rok
8. Verify successful upgrade for Rok¶
Ensure all pods in the
rok-system
namespace are up-and-running:root@rok-tools:/# kubectl get pods -n rok-system
Ensure all pods in the
rok
namespace are up-and-running:root@rok-tools:/# kubectl get pods -n rok
Ensure that Dex is up-and-running. Check that field READY is 1/1:
root@rok-tools:~/ops/deployments# kubectl get deploy -n auth NAME READY UP-TO-DATE AVAILABLE AGE dex 1/1 1 1 1m
Ensure that AuthService is up-and-running. Check that field READY is 1/1:
root@rok-tools:~/ops/deployments# kubectl get sts -n istio-system authservice NAME READY AVAILABLE AGE authservice 1/1 1 1m
Ensure that Reception is up-and-running. Check that field READY is 1/1:
root@rok-tools:~/ops/deployments# kubectl get deploy -n kubeflow kubeflow-reception NAME READY UP-TO-DATE AVAILABLE AGE kubeflow-reception 1/1 1 1 1m
Ensure that the Profiles Controller is up-and-running. Check that field READY is 1/1:
root@rok-tools:~/ops/deployments# kubectl get deploy -n kubeflow profiles-deployment NAME READY UP-TO-DATE AVAILABLE AGE profiles-deployment 1/1 1 1 1m
Upgrade Knative Serving CRs¶
EKF 1.4 uses Knative Serving 0.22.1 that stopped supporting deprecated CR versions. To be able to apply the new manifests you have to upgrade the storage version of existing Knative Serving CRs to the latest version of the CRD. To do so, follow the steps below:
Create the Knative post-install Job:
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > kubeflow/manifests/common/knative/knative-serving-post-install-jobs/overlays/deploy
Wait for the Job to complete:
root@rok-tools:~/ops/deployments# kubectl wait \ > --timeout=10m --for=condition=complete \ > -n knative-serving job/storage-version-migration-serving job.batch/storage-version-migration-serving condition met
Delete the completed job:
root@rok-tools:~/ops/deployments# rok-deploy --delete \ > kubeflow/manifests/common/knative/knative-serving-post-install-jobs/overlays/deploy
Upgrade Rok Monitoring Stack¶
Apply the new manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/monitoring/overlays/deploy \ > --force --force-kinds Deployment DaemonSet RoleBinding
Remove a stale RBAC resource that is left behind by the previous version of Rok:
root@rok-tools:~/ops/deployments# kubectl delete role -n monitoring kube-state-metrics \ > --ignore-not-found role.rbac.authorization.k8s.io "kube-state-metrics" deleted
Upgrade Kubeflow¶
This section describes how to upgrade Kubeflow. If you have not deployed Kubeflow in your cluster, you can safely skip this section.
Run the following command to update your Kubeflow installation:
root@rok-tools:~/ops/deployments# rok-deploy --apply install/kubeflow \
> --force --force-kinds Deployment StatefulSet
Upgrade NGINX Ingress Controller¶
This section describes how to upgrade the NGINX Ingress Controller. Run the following command to upgrade it:
root@rok-tools:~/ops/deployments# rok-deploy --apply \
> rok/nginx-ingress-controller/overlays/deploy/
Upgrade Cluster Autoscaler¶
In EKF 1.4 we have restructured the kustomization for the Cluster Autoscaler and made it easier to configure. In addition, we now use a custom image, supporting automatic scale-in operations on clusters running Rok. This section describes how to upgrade your Cluster Autoscaler deployment.
Warning
Skip this section and proceed with Delete stale Kubeflow resources, if you have not deployed the Cluster Autoscaler using Arrikto's manifests.
To upgrade your Cluster Autoscaler deployment follow the steps below:
Ensure that after rebase local changes exist only in the
deploy
overlay:root@rok-tools:~/ops/deployments# git diff --name-only \ > origin/release-1.4 -- rok/cluster-autoscaler/ | \ > grep -q -v -w deploy || echo OK OK
Undo any local changes in Cluster Autoscaler
kustomization.yaml
:root@rok-tools:~/ops/deployments# git checkout origin/release-1.4 -- \ > rok/cluster-autoscaler/overlays/deploy/kustomization.yaml
Set the necessary environment.
Specify the name of your EKS cluster. Replace
<EKS_CLUSTER>
with your EKS cluster name:root@rok-tools:~/ops/deployments# export EKS_CLUSTER=<EKS_CLUSTER>
Obtain the AWS account ID:
root@rok-tools:~/ops/deployments# export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
Obtain the name for the IAM Role that Cluster Autoscaler is currently using:
root@rok-tools:~/ops/deployments# export AUTOSCALER_EKS_IAM_ROLE=$(kubectl get sa \ > -n kube-system cluster-autoscaler \ > -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}' | \ > cut -d/ -f2) && echo ${AUTOSCALER_EKS_IAM_ROLE?} cluster-autoscaler-arrikto-cluster
Follow the Deploy Cluster Autoscaler guide from scratch to reconfigure and re-apply the necessary resources.
Delete stale Kubeflow resources¶
Run the following command to remove the deprecated resources left by the previous version of Kubeflow:
root@rok-tools:~/ops/deployments# rok-kf-prune --app kubeflow
Upgrade Notebooks for EKF 1.4¶
In Kubeflow 1.4, we have updated the access-ml-pipeline
PodDefault to work
with the latest KFP SDK that the new Jupyter Kale image ships with, and the
rok-auth
PodDefault to work with the latest changes introduced in Rok 1.4.
This means that existing Notebooks running with the old Jupyter Kale image:
- will not be able to use Rok to create snapshots.
- will be unable to access Kubeflow Pipelines.
- will fail to use InferenceServices for predictions.
- will produce pipeline runs with broken visualizations.
This section will guide you through upgrading all existing Notebooks that use an old Jupyter Kale image. To do so
Specify the old notebook image:
rok@rok-tools:~/ops/deployments# export IMAGE_FILTER="*jupyter-kale-py36:*"
Specify the new notebook image:
root@rok-tools:~/ops/deployments# export IMAGE=gcr.io/arrikto/jupyter-kale-py36:release-1.4-l0-release-1.4-rc2-14-g394bad1e6
Air Gapped
Use the mirrored
jupyter-kale-py36
image from your internal registry. For example:root@rok-tools:~/ops/deployments# export IMAGE=${INTERNAL_REGISTRY?}/gcr.io/arrikto/jupyter-kale-py36:release-1.4-l0-release-1.4-rc2-14-g394bad1e6
Upgrade the existing Notebooks to use the new image:
root@rok-tools:~/ops/deployments# rok-notebook-upgrade \ > --filter-image "${IMAGE_FILTER?}" \ > --image "${IMAGE?}"
Important
The above process updates only Notebooks using the jupyter-kale-py36
images. Notebooks with custom images should be upgraded manually to use an
image with KFP SDK >= 1.7.0 and the latest version of Kale.
Upgrade Notebook snapshot policies¶
Snapshot policies in Rok 1.4 directly snapshot Notebook CRs instead of Pods
belonging to Notebooks. This means that the filters of snapshot policies for
Notebooks should now use Notebook names rather than Pod names. Since Notebooks
run as StatefulSets in Kubernetes, the difference between a Pod and a Notebook
name is typically that the Pod name includes a -0
suffix.
Upgrading to Rok 1.4 will automatically update any existing equal
or
not_equal
filters of Notebook snapshot policies to remove the -0
suffix.
While this should cover most common cases, if you have existing policies that
define other filters such as starts_with
or ends_with
that include the
name of a Pod, you should manually update them to remove the -0
suffix from
the filter's value.
Verify successful upgrade¶
Follow the Test Kubeflow section to validate the updated Rok + EKF deployment.