Back Up EKF cluster¶
In this guide you will use our rok-backup
tool to snapshot all the EKF
resources of an Arrikto EKF cluster, add them into Rok buckets, and publish the
buckets to a Rok Registry. This way you will back up your EKF cluster so you can
later restore and migrate it to a new destination cluster.
Overview
What You’ll Need¶
- An existing Arrikto EKF and Rok Registry deployment.
- A Rok cluster registered to the Rok Registry.
- A Rok cluster configured for syncing.
- An issued token for a Rok Registry user.
- An EKF user to act as the admin EKF user.
- A privileged notebook server in the namespace of the admin EKF user.
- The latest Arrikto wheels installed in the notebook.
Procedure¶
Connect to the privileged notebook server and open a new terminal.
Set the Rok Registry token:
Read a line from the standard input:
jovyan@mynotebook-0:~$ read -s ROK_REGISTRY_TOKENPaste the Rok Registry token you issued by following the relevant guide.
Export the Rok Registry token:
jovyan@mynotebook-0:~$ export ROK_REGISTRY_TOKEN
Note
You can also provide the Rok Registry token in a file:
jovyan@mynotebook-0:~$ export ROK_REGISTRY_TOKEN="file:<PATH_TO_FILE>"Replace
<PATH_TO_FILE>
with the path of your Rok Registry token, for example:jovyan@mynotebook-0:~$ export ROK_REGISTRY_TOKEN="file:/home/jovyan/registry.token"Set the Rok Registry URL:
jovyan@mynotebook-0:~$ export ROK_REGISTRY_URL=<URL>Replace
<URL>
with the base URL of your Rok Registry installation. For example:jovyan@mynotebook-0:~$ export ROK_REGISTRY_URL=https://arr-cluster.example.com/registryChoose whether to start notebooks, depending on your EKF version of your source cluster:
If your Rok cluster version is older than 1.4, start the notebooks before snapshotting them:
jovyan@mynotebook-0:~$ export START_NOTEBOOKS=true jovyan@mynotebook-0:~$ export STOP_NOTEBOOKS=trueIf your Rok cluster version is 1.4 or greater, do not start the notebooks before snapshotting them:
jovyan@mynotebook-0:~$ export START_NOTEBOOKS=false jovyan@mynotebook-0:~$ export STOP_NOTEBOOKS=falseSet an identifier for the bucket prefix that the
rok-backup
will use when creating the Rok and Rok Registry buckets:jovyan@mynotebook-0:~$ export ROK_BUCKET_PREFIX=<MIGRATION_ID> jovyan@mynotebook-0:~$ export ROK_REGISTRY_BUCKET_PREFIX=${ROK_BUCKET_PREFIX?}Replace
<MIGRATION_ID>
with a custom unique name for the backup. For example, to include the date and a UID, run:jovyan@mynotebook-0:~$ export ROK_BUCKET_PREFIX=$(python3 -c \ > 'import uuid, datetime; \ > print("cluster-migration-%s-%s" \ > % (datetime.date.today(), uuid.uuid4().hex[:5]))') jovyan@mynotebook-0:~$ export ROK_REGISTRY_BUCKET_PREFIX=${ROK_BUCKET_PREFIX?}Important
Use a unique name for the
ROK_BUCKET_PREFIX
and theROK_REGISTRY_BUCKET_PREFIX
. This prefix distinguishes this backup run from others. If you have already used the same identifier forROK_BUCKET_PREFIX
orROK_REGISTRY_BUCKET_PREFIX
in a previous backup run, you will override the previous backup.Run the backup script to snapshot the EKF resources and publish them to the Rok Registry. Choose one of the following options, depending on whether you want the script to get its configuration options through environment variables or through a preseed file.
Choose one of the following options depending on whether you want to run the script interactively or non-interactively.
Note
In a non-interactive run you will not be prompted for input, while in an interactive run you will. If you have not explicitly specified an answer in the case of a non-interactive run,
rok-backup
will assume the default answer. The log output is redirected tostdout
.jovyan@mynotebook-0:~$ rok-backupTroubleshooting
dialog.ExecutableNotFound
If the above command fails with an error message similar to the following:
dialog.ExecutableNotFound: Executable not found: can't find the executable for the dialog-like programit means your notebook does not have the
dialog
package installed. You can install it with:jovyan@mynotebook-0:~$ sudo apt install dialogand retry the command.
jovyan@mynotebook-0:~$ rok-backup \ > --frontend non-interactiveCopy the
backup-preseed.py.j2
Jinja2 template inside your privileged notebook:backup-preseed.py.j21 # Copyright © 2022 Arrikto Inc. All Rights Reserved. 2 3 """EKF Migration Backup Preseed File.""" 4-42 4 5 SEEDS = { 6 # Resources to back up 7 'question/resources': ['bucket', 8 'katib', 9 'mlmd', 10 'model', 11 'notebook', 12 'pipeline', 13 'profile', 14 'pvc'], 15 # The token to connect to Rok 16 # 'question/rok_token': <protected>, 17 # The URL of the Rok cluster 18 'question/rok_url': 'http://rok.rok.svc.cluster.local', 19 # The token to connect to Rok Registry 20 'question/rok_registry_token': '{{ROK_REGISTRY_TOKEN}}', 21 # The URL of the Rok Registry cluster 22 'question/rok_registry_url': '{{ROK_REGISTRY_URL}}', 23 # The prefix for the local Rok buckets 24 'question/rok_bucket_prefix': 'cluster-migration', 25 # The prefix for the Registry buckets 26 'question/rok_registry_bucket_prefix': '{{ROK_REGISTRY_BUCKET_PREFIX}}', 27 # Namespaces to back up / exclude per resource 28 'question/buckets/exclude_namespaces': [], 29 'question/buckets/namespaces': ['ALL'], 30 'question/katib/exclude_namespaces': [], 31 'question/katib/namespaces': ['ALL'], 32 'question/models/exclude_namespaces': [], 33 'question/models/namespaces': ['ALL'], 34 'question/notebooks/exclude_namespaces': [], 35 'question/notebooks/namespaces': ['ALL'], 36 'question/pvcs/exclude_namespaces': [], 37 'question/pvcs/namespaces': ['ALL'], 38 # Skip notebooks for which a snapshot exists 39 'question/skip_existing_notebooks': False, 40 'question/skip_existing_profiles': False, 41 # Start stoppped notebooks so that a snapshot can be taken 42 'question/start_notebooks': '{{START_NOTEBOOKS}}', 43 # Stop notebooks started by the script 44 'question/stop_notebooks': '{{STOP_NOTEBOOKS}}' 45 } Render the preseed file:
jovyan@mynotebook-0:~$ j2 backup-preseed.py.j2 \ > -o backup-preseed.pyTroubleshooting
bash: j2: command not found
If the above command fails with an error message similar to the following:
bash: j2: command not foundit means your notebook does not have the
j2
Python package installed. You can install it with:jovyan@mynotebook-0:~$ pip3 install j2and retry the command.
Note
After rendering the preseed file, you can edit it to change the default value for any question and specify a custom answer.
Unset all exported environment variables:
jovyan@mynotebook-0:~$ unset ROK_REGISTRY_TOKEN ROK_REGISTRY_URL \ > ROK_BUCKET_PREFIX ROK_REGISTRY_BUCKET_PREFIX START_NOTEBOOKS \ > STOP_NOTEBOOKSRun the backup script. Choose one of the following options depending on whether you want to run the script interactively or non-interactively.
Note
In a non-interactive run you will not be prompted for input, while in a interactive run you will. If you have not explicitly specified an answer in the case of a non-interactive run,
rok-backup
will assume the default answer. The log output is redirected tostdout
.jovyan@mynotebook-0:~$ rok-backup \ > --preseed-load backup-preseed.pyTroubleshooting
dialog.ExecutableNotFound
If the above command fails with an error message similar to the following:
dialog.ExecutableNotFound: Executable not found: can't find the executable for the dialog-like programit means your notebook does not have the
dialog
package installed. You can install it with:jovyan@mynotebook-0:~$ sudo apt install dialogand retry the command.
jovyan@mynotebook-0:~$ rok-backup \ > --frontend non-interactive \ > --preseed-load backup-preseed.py
Verify¶
Connect to the privileged notebook server and open a new terminal.
Export the Rok bucket prefix:
jovyan@mynotebook-0:~$ export ROK_BUCKET_PREFIX=<ROK_BUCKET_PREFIX>Replace
<ROK_BUCKET_PREFIX>
with the name of the Rok bucket prefix you specified in step 5. For example:jovyan@mynotebook-0:~$ export ROK_BUCKET_PREFIX="cluster-migration-2022-07-07-d3674"Format the names of the migration buckets:
jovyan@mynotebook-0:~# export MLMD_BUCKET="${ROK_BUCKET_PREFIX?}-mlmd" \ > PIPELINES_BUCKET="${ROK_BUCKET_PREFIX?}-pipeline" \ > PROFILES_BUCKET="${ROK_BUCKET_PREFIX?}-profile" \ > NOTEBOOKS_BUCKET="${ROK_BUCKET_PREFIX?}-notebook" \ > MODELS_BUCKET="${ROK_BUCKET_PREFIX?}-model" \ > KATIB_BUCKET="${ROK_BUCKET_PREFIX?}-katib" \ > PVC_BUCKET="${ROK_BUCKET_PREFIX?}-pvc"Ensure that you have snapshotted and published the MLMD.
Make sure that the
metadata-mysql
exists in the MLMD migration bucket:jovyan@mynotebook-0:~# rok --account kubeflow -o json \ > object-list ${MLMD_BUCKET?} \ > | jq -r '.[].object_name' metadata-mysqlTroubleshooting
bash: jq: command not found
If the above command fails with an error message similar to the following:
bash: jq: command not found:it means your notebook does not have the
jq
package installed. You can install it with:jovyan@mynotebook-0:~$ sudo apt install jqand retry the command.
Make sure that the MLMD bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account kubeflow -o json \ > bucket-show ${MLMD_BUCKET?} \ > | jq -r '.throw_type' published
Ensure that you have snapshotted and published all pipelines.
Make sure that the
minio
andmysql
PVCs exist in the pipelines migration bucket:jovyan@mynotebook-0:~# rok --account kubeflow -o json \ > object-list ${PIPELINES_BUCKET?} \ > | jq -r '.[].object_name' minio-pv-claim mysql-pv-claimMake sure that the pipelines bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account kubeflow -o json \ > bucket-show ${PIPELINES_BUCKET?} \ > | jq -r '.throw_type' published
Ensure that you have snapshotted and published the Kubeflow profiles.
List all the Kubeflow profiles of the cluster and make sure that they also exist in the profiles migration bucket:
jovyan@mynotebook-0:~# diff <(kubectl get profiles -n kubeflow -o json \ > | jq -r '.items[].metadata.name' | sort) \ > <(rok --account kubeflow -o json object-list ${PROFILES_BUCKET?} \ > | jq -r '.[].object_name' | sort) \ > && echo "OK" || echo "FAIL" OKMake sure that the profiles bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account kubeflow -o json \ > bucket-show ${PROFILES_BUCKET?} \ > | jq -r '.throw_type' published
Choose a user namespace and verify that all EKF resources in that namespace have been snapshotted and published.
Export the user namespace:
jovyan@mynotebook-0:~$ export NAMESPACE=<NAMESPACE>Replace
<NAMESPACE>
with namespace of the user, for example:jovyan@mynotebook-0:~$ export NAMESPACE=kubeflow-user1Ensure that you have snapshotted and published all notebooks.
List all the notebooks of the source cluster and make sure that they also exist in the notebooks migration bucket:
jovyan@mynotebook-0:~# diff <(kubectl get notebooks -n ${NAMESPACE} -o json \ > | jq -r '.items[].metadata.name' | sort) \ > <(rok --account ${NAMESPACE?} -o json object-list ${NOTEBOOKS_BUCKET?} \ > | jq -r '.[].object_name' | sort) \ > && echo "OK" || echo "FAIL" OKMake sure that the notebooks bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account ${NAMESPACE?} -o json \ > bucket-show ${NOTEBOOKS_BUCKET?} \ > | jq -r '.throw_type' published
Ensure that you have snapshotted and published all models.
List all the Inference Services of the source cluster and make sure that they also exist in the models migration bucket:
jovyan@mynotebook-0:~# MODELS=$(for object in \ > $(rok --account ${NAMESPACE?} -o json object-list ${MODELS_BUCKET?} \ > | jq -r '.[].object_name') ; do \ > if [[ "$(rok --account ${NAMESPACE?} -o json object-meta-show ${MODELS_BUCKET?} $object \ > | jq -r '.type')" == "CR" ]] > then echo $object ; fi ; done)jovyan@mynotebook-0:~# diff <(kubectl get inferenceservices -n ${NAMESPACE} -o json \ > | jq -r '.items[].metadata.name' | sort) \ > <(echo "${MODELS?}" | sort) \ > && echo "OK" || echo "FAIL" OKMake sure that the models bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account ${NAMESPACE?} -o json \ > bucket-show ${MODELS_BUCKET?} \ > | jq -r '.throw_type' published
Ensure that you have snapshotted and published all Katib experiments.
List all the Katib experiments of the source cluster and make sure that they also exist in the Katib migration bucket:
jovyan@mynotebook-0:~# EXPERIMENTS=$(for object in \ > $(rok --account ${NAMESPACE?} -o json object-list ${KATIB_BUCKET?} \ > | jq -r '.[].object_name') ; do \ > if [[ "$(rok --account ${NAMESPACE?} -o json object-meta-show ${KATIB_BUCKET?} $object \ > | jq -r '.type')" == "experiment" ]] > then echo $object ; fi ; done)jovyan@mynotebook-0:~# diff <(kubectl get experiments -n ${NAMESPACE} -o json \ > | jq -r '.items[].metadata.name' | sort | xargs -I {} echo "experiment-{}") \ > <(echo "${EXPERIMENTS?}" | sort) \ > && echo "OK" || echo "FAIL" OKMake sure that the Katib bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account ${NAMESPACE?} -o json \ > bucket-show ${KATIB_BUCKET?} \ > | jq -r '.throw_type' published
Ensure that you have snapshotted and published all PVCs backed by the Rok storage class.
List all the PVCs of the source cluster backed by the Rok storage class. Make sure they also exist in the Rok PVCs migration bucket:
jovyan@mynotebook-0:~# diff <(kubectl get pvc -n ${NAMESPACE} -o json \ > | jq -r '.items[].metadata.name' | sort) \ > <(rok --account ${NAMESPACE?} -o json object-list ${PVC_BUCKET?} \ > | jq -r '.[].object_name' | sort) \ > && echo "OK" || echo "FAIL" OKMake sure that the Rok PVCs bucket has been successfully published to the Rok Registry:
jovyan@mynotebook-0:~# rok --account ${NAMESPACE?} -o json \ > bucket-show ${PVC_BUCKET?} \ > | jq -r '.throw_type' published
Navigate to the Rok UI and make sure that all the Rok buckets you chose to back up have been successfully published to the Rok Registry.
Go back to step 7, and repeat the steps for all the user namespaces for which you want to verify that notebooks, models, Katib experiments, PVCs, and buckets have been successfully snapshotted and published.
Summary¶
You have snapshotted all the EKF resources of the cluster, added them into Rok buckets, and published the buckets to a Rok Registry.
What’s Next¶
The next step is to subscribe to the buckets you just published, and present all EKF resources to the destination cluster.