Rok Snapshotting

This section will discuss the Rok cluster setup, how and when Rok takes snapshots, and how to restore snapshots.

Kale Kubeflow Pipeline and Rok Snapshots

Rok is present on every node in a MiniKF or Arrikto Enterprise Kubeflow cluster. Rok’s presence enables you to create snapshots of local PersistentVolumes to capture point in time representation of volume storage. Rok is deployed via its own Rok Operator, which runs as a StatefulSet, and which creates a number of resources: the most important is a Rok DaemonSet, a Rok CSI DaemonSet, which run on all nodes, and a Rok CSI Controller StatefulSet. The user just applies the Operator YAML and creates a new RokCluster resource. The Rok Operator communicates and synchronizes the Rok Pods across the cluster.

No message

Kale generates Kubeflow pipelines from tagged cells in the JupyterLab notebook, which has a workspace volume and zero or more data volumes. Arrikto does recommend defining at least one data volume during notebook server creation. When you create a Kubeflow Pipeline run using Kale, Rok will snapshot all notebook volumes and then clone them for the pipeline steps to use them. By seamlessly cloning the workspace and data volumes, your environment is versioned and replicated to the new pipeline, which is then executed.

Kubeflow Pipeline and Initial Rok Snapshot

In the Kubeflow Pipelines run you will see a step called create-volume-1. This is the step responsible for provisioning a clone of the notebook workspace volume. If there are data volumes attached to the notebook then you will see one step per data volume labelled as create-volume-#. You can select the pipeline run from the Kubeflow Pipelines UI to see the volume creation steps.

../_images/create-volumes.png

There are two important reasons for this cloning process.

  1. Marshaling: a mechanism to seamlessly pass data between steps. Kale uses a hidden folder within the workspace volume as the shared marshaling location to serialize and deserialize data.

  2. Reproducibility, experimentation: when working on a notebook, it is often the case that you install new libraries, write new modules, create or download assets required by your code. By seamlessly cloning the workspace and data volumes, all your environment is versioned and replicated to the new pipeline.

The newly cloned volumes are mounted by the rok-csi driver to the Pods in which the pipeline steps run. You can select the pipeline step cell and select Volumes to see where the dependent volumes are mounted. The Arrikto provided Docker images default to using the /home/ jovyan/* directories.

../_images/pipeline-step-volume-reference.png

Kubeflow Pipeline Steps and Rok Snapshots

Rok takes snapshots just before and just after execution of individual pipeline steps. This functionality allows you to restore not just an environment but the execution state of a Kubeflow pipeline initiated by Kale from within a JupyterLab notebook. You can select Visualizations on an individual step to see the snapshot details.

../_images/rok-snapshot-start-step.png

Scroll down to view he snapshot taken after the step completes, which will be tagged with final:

../_images/rok-snapshot-after-step.png

Rok Snapshot Creation and Rok Buckets

To create a snapshot, Rok hashes, de-duplicates, and versions snapshots and then stores the de-duplicated, content-addressable parts on an Object Storage service (e.g., Amazon S3) that is close to the specific Kubernetes cluster. Rok creates buckets of snapshots, which are associated with a specific KFP experiment, which is a set of Kubeflow Pipelines runs.

../_images/rok-ui-buckets.png

Expanding a bucket will show the list of snapshots that have been taken during Kubeflow pipeline execution. Each Kubeflow Pipelines run is associated with a file inside a Rok bucket. These associations can be identified through the Rok UI.

../_images/rok-ui-buckets-snapshots.png

The bucket name suffix will be reflected in the name of the pipeline runs and the name of the KFP experiment. Notice that in the image below the Rok bucket name contains the reference text 3mis2:

../_images/rok-identifier.png

Navigating over to the Runs page you can now relate the Rok bucket and snapshots with the associated pipeline runs. Notice in the image below each run contains the reference text 3mis2:

../_images/rok-identifier-pipeline.png

The runs are tagged with a pipeline version so they can be associated with a Kubeflow pipeline definition. In the image above the pipeline version is 19u9b, which can be found in the Pipelines page.

../_images/rok-identifier-pipeline-version.png

Rok Snapshots Outside of I/O Path

Snapshots are taken at the start of execution and before and after each step. However taking snapshots does not impact application performance. Snapshotting happens outside the application I/O path using Arrikto’s Linux kernel enhancements contained in the Arrikto provided Docker images. These enhancements are loaded as modules into the kernel during the creation of the Rok cluster. As a result, snapshotting will not impact performance of applications in the Pods.

No message

Rok Snapshots & Environment Restoration

Rok snapshots provide the ability to version entire environments much in the same way that you can version code with Git. Rok snapshots enable you to iterate on your environment as you transform datasets, develop models, and develop other aspects of an ML workflow. The pipeline is always reproducible, thanks to the immutable snapshots, and you do not have to build new Docker images for each pipeline run. To view the snapshots select a specific bucket from the Rok UI. Rok enables restoring snapshots in their entirety on new notebook servers. This enables you to recreate a complete environment exactly the way it was at any point in time. To do this you need to copy the URL from the Rok snapshot.

../_images/rok-snapshot-url.png

With the URL copied you can provide the Rok snapshot as reference when creating a new notebook server. If the URL is accepted you will see a notification as shown in the image below:

../_images/rok-url-in-notebook.png

You can confirm that the Rok snapshot was successfully loaded by scrolling down to the workspace volume and data volume, which will be pre-populated based on the data in the snapshot.

../_images/rok-url-in-notebook-volume-load.png

Note

If you are creating a notebook server and realize you do not have the Rok URL readily available, you can select the Rok search icon in the Rok URL box. This will open a pop-out window with the available buckets and snapshots from which you can select a notebook snapshot.

Rok Snapshots and Volume Restoration

Rok permits loading specific volumes from snapshots during creation of the notebook server. This can be done with either the workspace volume or the data volume. As with the entire snapshot each of the volumes has a specific URL with the workspace volume always first in the list of snapshot volumes.

../_images/rok-snapshot-volume-links.png

To populate either volume during notebook server creation you will need to change the option from New to Existing and paste the URL.

../_images/notebook-server-workspace-volume.png

Note

If you are creating a notebook server and realize you do not have the URL for the volume readily available, you can select the Rok search icon in the Rok URL box. This will open a pop-out window with the available buckets and snapshots from which you can select a volume. Once you have launched the notebook server you will see preloaded assets based on your selections.

Rok Snapshots and Pipeline Restoration

Rok permits loading the snapshot taken before or after the execution of pipeline steps into a notebook server to recreate the captured execution state. The snapshot and the associated URL can be accessed from the snapshot in the Rok UI link under the Visualizations option in the KFP UI.

../_images/step-snapshot-link.png

The Rok UI will open to the appropriate snapshot and you can select the Rok snapshot URL.

../_images/step-rok-snapshot-url.png

Once you have copied the URL you can use it to recreate the execution state of the pipeline step when creating a notebook server.