Rok 1.1¶
This guide describes the necessary steps to upgrade an existing Rok cluster on Kubernetes from version 1.0 to the latest version, 1.4.4.
Check Kubernetes Version¶
Rok 1.1 only supports Kubernetes version 1.17. Follow the instructions bellow to verify the Kubernetes version for your cluster, before continuing with the upgrade.
Check your cluster version by inspecting the value of
Server Version
in the following command:root@rok-tools:/# kubectl version --short Client Version: v1.16.9 Server Version: v1.17.17-eks-c5067d
If your Server Version is
v1.17.*
, you may proceed with the upgrade. If not, please follow the upgrade your cluster guide to upgrade your cluster to Kubernetes 1.17.
Upgrade your management environment¶
We assume that you have followed the Deploy Rok Components guide, and have
successfully set up a full-fledged rok-tools
management environment either
in local Docker or in Kubernetes.
Before proceeding with the core upgrade steps you need to first upgrade your
management environment, in order to use CLI tools and utilities, such as
rok-deploy
that are compatible with the Rok version you are upgrading to.
Important
When you upgrade your management environment all previous data (GitOps
repository, files, user settings, etc.) are preserved either in a Docker
volume or Kubernetes PVC, depending on your environment. This volume or PVC
is mounted in the new rok-tools
container so that old data is adopted.
For Kubernetes simply apply the latest rok-tools
manifests:
$ kubectl apply -f <download_root>/rok-tools-eks.yaml
Note
In case you see the following error:
The StatefulSet "rok-tools" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'template', and
'updateStrategy' are forbidden
make sure your first delete the existing rok-tools
StatefulSet with:
$ kubectl delete sts rok-tools
and then re-apply.
For Docker first delete the old container:
$ docker stop <OLD_ROK_TOOLS_CONTAINER_ID>
$ docker rm <OLD_ROK_TOOLS_CONTAINER_ID>
and then create a new one with previous data and the new image:
$ docker run -ti \
> -p 8080:8080 \
> --entrypoint /bin/bash \
> -v $(pwd)/rok-tools-data:/root \
> gcr.io/arrikto/rok-tools:release-1.4-l0-release-1.4.4
Upgrade manifests¶
We assume that you have followed the Deploy Rok Components guide, and have a local GitOps repo with Arrikto-provided manifests. Once Arrikto releases a new Rok version and pushes updated deployment manifests, you have to follow the standard GitOps workflow:
- Fetch latest upstream changes, pushed by Arrikto.
- Rebase local changes on top of the latest upstream ones and resolve conflicts, if any.
- Tweak manifests based on Arrikto-provided instructions, if necessary.
- Commit everything.
- Re-apply manifests.
When one initially deploys Rok on Kubernetes, either automatically using
rok-deploy
or manually, they end up with a deploy
overlay in each Rok
component or external service that is to be applied to Kubernetes. In the GitOps
deployment repository, Arrikto provides manifests that include the deploy
overlay in each Kustomize app/package as scaffold, so that users can quickly
start and set their preferences.
As a result, fetch/rebase might lead to conflicts since both Arrikto and the end-user might modify the same files that are tracked by Git. In this scenario, the most common and obvious solution is to keep the user's changes since they are the ones that reflect the existing deployment.
In case of breaking changes, e.g., parts of YAML documents that are absolutely necessary to perform the upgrade, or others that might be deprecated, Arrikto will inform users via version-specific upgrade nodes for all actions that need to be taken.
Note
It is the user's responsibility to apply valid manifests and kustomizations after a rebase. In case of uncertainty do not hesitate to coordinate with Arrikto's Tech Team for support.
We will use git
to update local manifests. You are about to rebase your work
on top of latest pre-release branch. To favor local changes upon conflicts, we
will use the corresponding merge strategy option.
Important
During rebase, the sides are swapped, i.e., ours
is the so-far
rebased series, and theirs
is the working branch. For more
information on the merge strategy read the official git-scm docs.
To upgrade the manifests:
Fetch latest upstream changes:
$ git fetch --all -p
Retrieve the release channel your are currently following. This is actually the upstream branch of your GitOps repo. To retrieve it run:
$ git rev-parse --abbrev-ref --symbolic-full-name @{u} origin/release-1.0
Important
If you are currently on the
release-1.0
channel (i.e., your branch is based onorigin/release-1.0
) and want to update toorigin/release-1.1
, follow the Switch release channel section before proceeding.Rebase on top of the latest pre-release version:
$ git rebase -Xtheirs
Rebasing your work, may cause conflicts, e.g., when a file was modified locally but removed from upstream, for example:
CONFLICT (modify/delete): kubeflow/kfctl_config.yaml deleted in origin/develop and modified in HEAD~61. Version HEAD~61 of kubeflow/kfctl_config.yaml left in tree.
We suggest you go ahead and delete those files, e.g.:
$ git status --porcelain | awk '{if ($1=="DU") print $2}' | xargs git rm
And proceed with the rebase:
$ git rebase --continue
(Optional) Edit
deploy
overlays based on version-specific upgrade notes.Commit changes, if any.
Important
Make sure you mirror the GitOps repo to a private remote to be able to recover it in any case.
Drain rok-csi nodes¶
To ensure minimal disruption of Rok services, please follow the following instructions to drain Rok CSI nodes, and wait for any pending Rok CSI operations to complete, before performing the upgrade.
During the upgrade, any pending Rok tasks will be canceled, so it is advisable to run the following steps in a period of inactivity, e.g., when no pipelines or snapshot policies run. Since pausing/queuing everything is currently not an option, one can monitor Rok logs and wait until nothing has been logged for, let's say, 30 secs:
$ kubectl -n rok logs -l app=rok-csi-controller -c csi-controller -f --tail=100
Note
Finding a period of inactivity is an ideal scenario, that depending on the deployment may not be feasible, e.g., when having tens of recurring pipelines running. In such a case the end-user will simply see some of them fail.
Scale down the
rok-operator
StatefulSet:$ kubectl -n rok-system scale sts rok-operator --replicas=0
Ensure
rok-operator
scaled down to zero:$ kubectl get sts rok-operator -n rok-system
Scale down the
rok-csi-controller
StatefulSet:$ kubectl -n rok scale sts rok-csi-controller --replicas=0
Ensure
rok-csi-controller
scaled down to zero:$ kubectl get sts rok-csi-controller -n rok
Watch the
rok-csi-node
logs and ensure that all pending operations have finished, i.e., nothing has been logged for the last 30 secs:$ kubectl -n rok logs -l app=rok-csi-node -c csi-node -f --tail=100
Continue with the Upgrade components section.
Upgrade components¶
We assume that you are already running a v1.0
Rok cluster on Kubernetes
and that you also have access to the 1.4.4
kustomization tree you are
upgrading to.
Since a Rok cluster on Kubernetes consists of multiple components, you need to upgrade each one of them. Throughout the guide, we will keep track of these components, as listed in the table below:
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
During the upgrade, Rok Operator will remove all members from the cluster and add a dedicated one to perform the upgrade. The cluster will be scaled down to zero and a Kubernetes Job will run to upgrade the cluster config on etcd and run any needed migrations. Finally, the cluster will be scaled back up to its initial size.
1. Increase observability (optional)¶
To gain insight into the status of the cluster upgrade execute the following commands in a separate window:
For live cluster status:
$ watch kubectl get rokcluster -n rok
For live cluster events:
$ watch 'kubectl describe rokcluster -n rok rok | tail -n 20'
2. Inspect current version (optional)¶
Get current images and version from the RokCluster CR:
$ kubectl describe rokcluster rok -n rok
...
Spec:
Images:
Rok: gcr.io/arrikto-deploy/roke:l0-release-v1.0
Rok CSI: gcr.io/arrikto-deploy/rok-csi:l0-release-v1.0
Status:
Version: l0-release-v1.0
3. Upgrade Rok Disk Manager¶
Apply the latest Rok Disk Manager manifests:
$ kubectl apply -k rok/rok-disk-manager/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
4. Upgrade Rok kmod¶
Apply the latest Rok kmod manifests:
$ kubectl apply -k rok/rok-kmod/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
Important
Rok 1.4.4
ships with a new version of the dm-era
module. As
rok-kmod
tries to load this new version, it is expected to fail (enter in
a CrashLoopBackOff) since the old version of the module is in use and
cannot be unloaded. For that to be fixed, you will have to perform a rolling
reboot of all the nodes in the Kubernetes cluster after the upgrade has
finished.
5. Upgrade Rok cluster¶
Apply the latest Rok cluster manifests:
$ kubectl apply -k rok/rok-cluster/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
6. Upgrade Rok Operator¶
Apply the latest Operator manifests:
$ kubectl apply -k rok/rok-operator/overlays/deploy
Note
The above command also updates the RokCluster
CRD
After the manifests have been applied, ensure Rok Operator has become ready by running the following command:
$ watch kubectl get pods -n rok-system -l app=rok-operator
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
7. Verify successful upgrade¶
Check the status of the cluster upgrade Job:
$ kubectl get job -n rok rok-upgrade-release-1.4-l0-release-1.4.4
Ensure that Rok is up and running after the upgrade Job finishes:
$ kubectl get rokcluster -n rok rok
Ensure all pods in the
rok-system
namespace are up-and-running:$ kubectl get pods -n rok-system
Ensure all pods in the
rok
namespace are up-and-running:$ kubectl get pods -n rok
Upgrade Kubeflow manifests¶
This section describes how to upgrade Kubeflow . If you have not deployed Kubeflow in your cluster, you can safely skip this section.
Kubeflow manifests are upgraded in a fetch/rebase/apply manner. Assuming that you have already gone through Upgrade manifests to upgrade Kubeflow all you need it to re-apply them:
$ rok-deploy --apply install/kubeflow
Afterwards, make sure to validate the updated Kubeflow deployment by following the Test Kubeflow section.
Rolling reboot of Kubernetes cluster¶
Rok 1.4.4
ships with a new version of the dm-era
module. In order
for this module to be loaded, you will have to reboot all nodes in the Kubernetes
cluster. Use the rok-k8s-reboot
tool to perform a rolling reboot of all the
nodes on the Kubernetes cluster, by spawning a Job on each node and reboot
them:
$ rok-k8s-reboot
Reset the CBT data of all Rok PVCs¶
Due to an upstream kernel bug in the dm-era
module, the CBT (Changed Block
Tracking) data that Rok maintains for each PVC might be incomplete, leading to
corrupted snapshots.
To work around this you need to reset the CBT data of all Rok PVCs:
$ rok-reset-cbt
This ensures that the next snapshot of all Rok PVCs will copy the whole volume, thus avoiding data loss, and will reset the CBT data.
Upgrade Kubeflow resources¶
This section describes how to upgrade Kubeflow resources (Notebooks, Pipelines)
so they are compatible with Rok 1.4.4
. If you have not deployed
Kubeflow in your cluster, you can safely skip this section.
Rok 1.4.4
introduces a number of authentication and authorization
improvements across its components that break backwards compatibility with the
v1.0
Rok client. As a result, several Kubeflow workloads that rely on the
Rok client such as notebooks and pipelines are expected to fail after the
upgrade. To restore their functionality, you need to upgrade them to use the
1.4.4
Rok client, using the instructions in this section.
Upgrade notebooks¶
Jupyter notebooks with Kale support use the Rok client to create snapshots of
their volumes when submitting pipelines. This feature is expected to break after
upgrading to Rok 1.4.4
. You can verify that this is the case by
enabling the Kale panel from withing the notebook, and checking that the
Use this notebook's volumes and Take Rok snapshots during each step
switches are disabled. Note that depending on which version of Kale you are
using, these switches may be under the Advanced Settings section. Clicking
on More info... in the switches' tooltip should reveal that the
rok.check_rok_availability()
method failed with either a
401 (Unauthorized)
HTTP error, or a ValueError
about not being able to
retrieve the Rok API URL, depending on whether the notebook's pods have been
restarted after the upgrade.
To fix the above errors, the Docker image of these notebooks needs to be updated
to gcr.io/arrikto/jupyter-kale:v1.1-pre-207-g6c6282e72
, which includes an
up-to-date version of the Rok client. You can achieve this for all notebooks by
executing the rok-notebook-upgrade
script, which is available within the
rok-tools
management environment, and following the on-screen instructions:
$ rok-notebook-upgrade --image gcr.io/arrikto/jupyter-kale:v1.1-pre-207-g6c6282e72
Alternatively, you can update the image of each notebook manually by snapshotting it and restoring it with the updated image. You can achieve this via the following steps:
- Visit the Snapshots tab of the Kubeflow dashboard.
- Create a new bucket or choose one of the existing buckets to store the created snapshot.
- Click the Snapshot button from within the bucket.
- Select the JupyterLab service.
- Provide the namespace, the name of the notebook and the name of the snapshot to create.
- Click Snapshot. Once the created task completes, a new snapshot will appear in the bucket with the name you selected.
- Visit the Notebooks tab of the Kubeflow dashboard.
- Delete the old notebook, if you intend to recreate it with the same name.
- Click on + NEW SERVER.
- Click on the file chooser in the Rok URL field and select the snapshot you just created.
- Provide a name for the notebook.
- Make sure Custom image is ticked, and set the image name to
gcr.io/arrikto/jupyter-kale:v1.1-pre-207-g6c6282e72
. - Click LAUNCH to create the updated notebook.
Upgrade pipelines¶
Kubeflow pipelines that were created using Kale use the Rok client to create
snapshots of their steps. These pipelines are also expected to stop working
after the upgrade, and need to be resubmitted. You can ensure that this is the
case by selecting the failed step of a failed pipeline run, visiting the
Logs tab, and verifying that the step failed with the
ValueError: A Rok API URL must be provided either via the `url' argument or the `ROK_GW_URL' environment variable
error, due to not being able to retrieve the Rok API URL from the pod's
environment.
You can fix this by re-creating the pipelines using the updated notebooks, via the following steps:
- Restore the notebook that was used to create the pipeline, if it no longer
exists, changing its image to
gcr.io/arrikto/jupyter-kale:v1.1-pre-207-g6c6282e72
, by following the manual instructions in the Upgrade notebooks section. Note that alternatively, if the notebook no longer exists but you still have the snapshots created by one of the pipeline's runs, you can also restore the notebook by using the Rok URL of one of these snapshots. - Submit a new version of the pipeline using Kale. In order for the new pipeline to use the updated image, you need to change it manually from the Advanced Settings section of the Kale panel before submitting it. Note that leaving the image empty will also make Kale fall back to using the current notebook's image, which should suffice in this case.
Upgrade recurring runs¶
Any existing recurring runs on Kubeflow that use one of the failing pipelines described in section Upgrade pipelines need to also be recreated. You can check for the existence of such a run via the following steps:
- Visit the Jobs tab of the Kubeflow dashboard to see a summary of existing recurring runs.
- Click on the experiment each job belongs to and see if it contains any failed runs that belong to that job. Note that you can see the recurring run a run belongs to (if any) in the Recurring run column of this list.
- Click on the failed run to view its steps, then click on a step that failed, and verify that the error that caused the failure is the one mentioned in the Upgrade pipelines section.
To update such a failing recurring run, you can follow these steps:
- Update the pipeline the recurring run uses via the instructions of section Upgrade pipelines.
- Visit the Jobs tab of the Kubeflow dashboard and click on the recurring run you want to recreate.
- Click on Clone recurring run.
- Select the updated pipeline in the Pipeline field. Note that this should also select the latest version of that pipeline in the Pipeline Version field.
- Click Start to create the updated recurring run.
- Return to the Jobs tab of the Kubeflow dashboard, select the old failing recurring run and click Delete to delete it.
Upgrade ALB Ingress controller¶
Until now we were using ALB Ingress Controller v1.1.4. In v2.0 this got renamed to AWS Load Balancer Controller.
To upgrade it one should follow the official migration guide.
This boils down to:
Install the new controller from scratch. To do so please follow the corresponding guide.
Purge old controller:
$ kubectl delete -f rok/alb-ingress-controller/deployment.yaml $ kubectl delete -f rok/alb-ingress-controller/rbac.yaml
The existing Ingress resources will continue to work as expected.
Upgrade NGINX Ingress Controller manifests¶
The rok/nginx-ingress-controller/service.yaml
and
rok/nginx-ingress-controller/ingress.yaml
manifests are now deprecated and
have been converted into a kustomization package. Please follow the
installation guide for NGINX Ingress Controller to configure and re-apply the necessary
resources.
Upgrade Istio manifests¶
The rok/istio/istio-ingress-nginx.yaml
manifests are now deprecated and have
been converted into a kustomization package. Please follow the
installation guide for exposing Istio from scratch to
configure and re-apply the necessary resources.