Restore EKF cluster

In this guide you will use our rok-restore tool to subscribe to Rok Registry buckets, download the snapshots of all EKF resources of a source Arrikto EKF cluster, and present them to the destination Arrikto EKF cluster. This way you will complete the migration of an Arrikto EKF cluster.

What You’ll Need

Check Your Environment

  1. Get the version of the Katib mysql database in the source cluster:

    root@rok-tools-src:~# kubectl exec -n kubeflow svc/katib-mysql -c katib-mysql \ > -- mysql --version \ > | tr -s ' ' \ > | cut -d ' ' -f 3 8.0.23
  2. Get the version of the Katib mysql database in the destination cluster:

    root@rok-tools-dst:~# kubectl exec -n kubeflow svc/katib-mysql -c katib-mysql \ > -- mysql --version \ > | tr -s ' ' \ > | cut -d ' ' -f 3 8.0.26
  3. Ensure that the Katib mysql version in the destination cluster is greater than the version in the source cluster. If not, update the Katib mysql in the destination cluster to the mysql version of the source cluster:

    1. Export the mysql version of the source cluster:

      root@rok-tools-dst:~# export MYSQL_VERSION_SOURCE=<MYSQL_VERSION_SOURCE>

      Replace <MYSQL_VERSION_SOURCE> with the mysql version that you found in step 1. For example:

      root@rok-tools-dst:~# export MYSQL_VERSION_SOURCE=8.0.23
    2. Update the Katib mysql in the destination cluster:

      root@rok-tools-dst:~# kubectl patch -n kubeflow deploy katib-mysql \ > -p "{\"spec\": {\"template\": {\"spec\":{\"containers\":[{\"name\":\"katib-mysql\",\"image\":\"mysql:${MYSQL_VERSION_SOURCE?}\"}]}}}}"

Procedure

  1. Connect to a privileged notebook server and open a new terminal.

  2. Set the Rok Registry token:

    1. Read a line from the standard input:

      jovyan@mynotebook-0:~$ read -s ROK_REGISTRY_TOKEN
    2. Paste the Rok Registry token you issued by following the relevant guide.

    3. Export the Rok Registry token:

      jovyan@mynotebook-0:~$ export ROK_REGISTRY_TOKEN

    Note

    You can also provide the Rok Registry token in a file:

    jovyan@mynotebook-0:~$ export ROK_REGISTRY_TOKEN="file:<PATH_TO_FILE>"

    Replace <PATH_TO_FILE> with the path of your Rok Registry token, for example:

    jovyan@mynotebook-0:~$ export ROK_REGISTRY_TOKEN="file:/home/jovyan/registry.token"
  3. Set the Rok Registry URL:

    jovyan@mynotebook-0:~$ export ROK_REGISTRY_URL=<URL>

    Replace <URL> with the base URL of your Rok Registry installation. For example:

    jovyan@mynotebook-0:~$ export ROK_REGISTRY_URL=https://arr-cluster.example.com/registry
  4. Set the Rok and Rok Registry bucket prefix you used when running the backup script in the source cluster, to distinguish which backup run to restore (step 5 of the backup guide):

    jovyan@mynotebook-0:~$ export ROK_BUCKET_PREFIX=<MIGRATION_ID> jovyan@mynotebook-0:~$ export ROK_REGISTRY_BUCKET_PREFIX=${ROK_BUCKET_PREFIX?}

    Replace <MIGRATION_ID> with the identifier you specified in the corresponding invocation of the rok-backup in the source cluster, for example:

    jovyan@mynotebook-0:~$ export ROK_BUCKET_PREFIX="cluster-migration-2022-07-07-d3674" jovyan@mynotebook-0:~$ export ROK_REGISTRY_BUCKET_PREFIX=${ROK_BUCKET_PREFIX?}

    Important

    Use exactly the same Rok and Rok Registry bucket prefix you used in the corresponding backup run you want to restore.

  5. Run the restore script to subscribe to Rok Registry and present the EKF resources to the destination cluster. Choose one of the following options, depending on whether you want the script to get its configuration options through environment variables or through a preseed file.

    Choose one of the following options depending on whether you want to run the script interactively or non-interactively.

    Note

    In a non-interactive run you will not be prompted for input, while in an interactive run you will. If you have not explicitly specified an answer, rok-restore will assume the default answer. The log output is redirected to stdout.

    jovyan@mynotebook-0:~$ rok-restore

    Troubleshooting

    dialog.ExecutableNotFound

    If the above command fails with an error message similar to the following:

    dialog.ExecutableNotFound: Executable not found: can't find the executable for the dialog-like program

    it means your notebook does not have the dialog package installed. You can install it with:

    jovyan@mynotebook-0:~$ sudo apt install dialog

    and retry the command.

    jovyan@mynotebook-0:~$ rok-restore \ > --frontend non-interactive
    1. Copy the restore-preseed.py.j2 Jinja2 template inside your privileged notebook:

      restore-preseed.py.j2
      1# Copyright © 2022 Arrikto Inc. All Rights Reserved.
      2
      3"""EKF Migration Restoration Preseed File."""
      4-52
      4
      5SEEDS = {
      6 # Resources to restore
      7 'question/resources': ['bucket',
      8 'katib',
      9 'mlmd',
      10 'model',
      11 'notebook',
      12 'pipeline',
      13 'profile',
      14 'pvc'],
      15 # The token to connect to Rok
      16 # 'question/rok_token': <protected>,
      17 # The URL of the Rok cluster
      18 'question/rok_url': 'http://rok.rok.svc.cluster.local',
      19 # The token to connect to Rok
      20 'question/rok_registry_token': '{{ROK_REGISTRY_TOKEN}}',
      21 # The URL of the Rok Registry cluster
      22 'question/rok_registry_url': '{{ROK_REGISTRY_URL}}',
      23 # The prefix for the local Rok buckets
      24 'question/rok_bucket_prefix': 'cluster-migration',
      25 # The prefix for the Registry buckets
      26 # This MUST be the same as the one provided when running the backup
      27 'question/rok_registry_bucket_prefix': '{{ROK_REGISTRY_BUCKET_PREFIX}}',
      28 # Namespaces to restore / exclude per resource
      29 'question/buckets/exclude_namespaces': [],
      30 'question/buckets/namespaces': ['ALL'],
      31 'question/katib/exclude_namespaces': [],
      32 'question/katib/namespaces': ['ALL'],
      33 'question/models/exclude_namespaces': [],
      34 'question/models/namespaces': ['ALL'],
      35 'question/notebooks/exclude_namespaces': [],
      36 'question/notebooks/namespaces': ['ALL'],
      37 'question/pvcs/exclude_namespaces': [],
      38 'question/pvcs/namespaces': ['ALL'],
      39 # Delete Kubernetes resources for which a copy has been found on the
      40 # Registry, in order to restore the new version
      41 'question/overwrite_all_buckets': True,
      42 'question/overwrite_all_experiments': True,
      43 'question/overwrite_all_models': True,
      44 'question/overwrite_all_notebooks': True,
      45 'question/overwrite_all_profiles': True,
      46 'question/overwrite_all_pvcs': True,
      47 # Start migrated resources after restoring them
      48 'question/stop_notebooks': True,
      49 'question/stop_models': True,
      50 'question/stop_recurring_runs': True,
      51 # Low priority question, applying default notebook configurations
      52 # If no configurations are provided, this will default to whatever is
      53 # included in the Notebook CR
      54 'question/notebook_configurations': []
      55}
    2. Render the preseed file:

      jovyan@mynotebook-0:~$ j2 restore-preseed.py.j2 \ > -o restore-preseed.py

      Troubleshooting

      bash: j2: command not found

      If the above command fails with an error message similar to the following:

      bash: j2: command not found

      it means your notebook does not have the j2 Python package installed. You can install it with:

      jovyan@mynotebook-0:~$ pip3 install j2

      and retry the command.

      Note

      After rendering the preseed file, you can edit it to change the default value for any question and specify a custom answer.

    3. Unset all exported environment variables:

      jovyan@mynotebook-0:~$ unset ROK_REGISTRY_TOKEN ROK_REGISTRY_URL \ > ROK_BUCKET_PREFIX ROK_REGISTRY_BUCKET_PREFIX
    4. Run the restore script. Choose one of the following options depending on whether you want to run the script interactively or non-interactively.

      Note

      In a non-interactive run you will not be prompted for input, while in an interactive run you will. If you have not explicitly specified an answer, rok-restore will assume the default answer. The log output is redirected to stdout.

      jovyan@mynotebook-0:~$ rok-restore \ > --preseed-load restore-preseed.py

      Troubleshooting

      dialog.ExecutableNotFound

      If the above command fails with an error message similar to the following:

      dialog.ExecutableNotFound: Executable not found: can't find the executable for the dialog-like program

      it means your notebook does not have the dialog package installed. You can install it with:

      jovyan@mynotebook-0:~$ sudo apt install dialog

      and retry the command.

      jovyan@mynotebook-0:~$ rok-restore \ > --frontend non-interactive \ > --preseed-load restore-preseed.py

    Note

    Notebooks, models, and recurring pipelines are restored in a stopped state by default to avoid a cluster scale-out. Add the CLI arguments --no-stop-notebooks, --no-stop-models, and --no-stop-recurring-runs to restore the corresponding resources in the state they were on the source cluster.

Verify

  1. Connect to the privileged notebook servers in the source and destination clusters, and open a new terminal in each one of them.

  2. List all Kubeflow profiles in the source and destination clusters. Ensure that all Kubeflow profiles are the same, that is, the following command produces the same output in both clusters:

    jovyan@notebook-0:~$ kubectl get profiles -A -o json \ > | jq -r '.items[].metadata.name' kubeflow-user1 kubeflow-user2

    Troubleshooting

    bash: jq: command not found

    If the above command fails with an error message similar to the following:

    bash: jq: command not found:

    it means your notebook does not have the jq package installed. You can install it with:

    jovyan@mynotebook-0:~$ sudo apt install jq

    and retry the command.

  3. List all notebooks in the source and destination clusters. Ensure that all notebooks are the same, that is, the following command produces the same output in both clusters:

    jovyan@notebook-0:~$ kubectl get notebooks -A -o json \ > | jq -r '.items[].metadata.name' notebook1 notebook2
  4. List all pipelines in the source and destination clusters. Ensure that all pipelines are the same, that is, the following command produces the same output in both clusters:

    jovyan@notebook-0:~$ python3 -c \ > "import kfp; print([p.name for p in kfp.Client().list_pipelines().pipelines])" ['pipeline1', 'pipeline2']
  5. List all models in the source and destination clusters. Ensure that all models are the same, that is, the following command produces the same output in both clusters:

    jovyan@notebook-0:~$ kubectl get inferenceservices -A -o json \ > | jq -r '.items[].metadata.name' model1 model2
  6. List all Katib experiments in the source and destination clusters. Ensure that all Katib experiments are the same, that is, the following command produces the same output in both clusters:

    jovyan@notebook-0:~$ kubectl get experiments -A -o json \ > | jq -r '.items[].metadata.name' experiment1 experiment2
  7. List all the PVCs backed by the Rok storage class in the source and destination clusters. Ensure that all PVCs backed by the Rok storage class are the same, that is, the following command produces the same output in both clusters:

    jovyan@notebook-0:~$ kubectl get pvc -A -o json \ > | jq -r '.items[] | select(.spec.storageClassName=="rok") | .metadata.name' pvc1 pvc2
  8. Navigate to the Rok UI and make sure that all of the Rok buckets you chose to back up from the source cluster exist in the destination cluster.

Summary

You have subscribed to the Rok Registry buckets that contain the EKF resources’ snapshots of a source Arrikto EKF cluster, and presented them to the destination Arrikto EKF cluster.

What’s Next

Check out the rest of the maintenance operations that you can perform on your cluster.