Data Passing with Volumes

This guide will walk you through understanding how Kale uses snapshots and volumes for data passing.

What You’ll Need

  • An EKF or MiniKF deployment with the default Kale Docker image.
  • An understanding of how Kale SDK works.

Procedure

  1. Create a new Notebook server using the default Kale Docker image. The image will have the following naming scheme:

    gcr.io/arrikto/jupyter-kale-py38:<IMAGE_TAG>

    Note

    The <IMAGE_TAG> varies based on the MiniKF or EKF release.

  2. Connect to the server, open a terminal, and install scikit-learn:

    $ pip3 install --user scikit-learn==0.23.0
  3. Create a new python file and name it kale_data_passing.py:

    $ touch kale_data_passing.py
  4. Copy and paste the following code inside kale_data_passing.py:

    sdk.py
    1# Copyright © 2021-2022 Arrikto Inc. All Rights Reserved.
    2
    3"""Kale SDK.
    4
    5This script trains an ML pipeline to solve a binary classification task.
    6"""
    7
    8from kale.sdk import pipeline, step
    9from sklearn.datasets import make_classification
    10from sklearn.linear_model import LogisticRegression
    11from sklearn.model_selection import train_test_split
    12
    13
    14@step(name="data_loading")
    15def load(random_state):
    16 """Create a random dataset for binary classification."""
    17 rs = int(random_state)
    18 x, y = make_classification(random_state=rs)
    19 return x, y
    20
    21
    22@step(name="data_split")
    23def split(x, y):
    24 """Split the data into train and test sets."""
    25 x, x_test, y, y_test = train_test_split(x, y, test_size=0.1)
    26 return x, x_test, y, y_test
    27
    28
    29@step(name="model_training")
    30def train(x, x_test, y, training_iterations):
    31 """Train a Logistic Regression model."""
    32 iters = int(training_iterations)
    33 model = LogisticRegression(max_iter=iters)
    34 model.fit(x, y)
    35 print(model.predict(x_test))
    36
    37
    38@pipeline(name="binary-classification", experiment="kale-tutorial")
    39def ml_pipeline(rs=42, iters=100):
    40 """Run the ML pipeline."""
    41 x, y = load(rs)
    42 x, x_test, y, y_test = split(x=x, y=y)
    43 train(x, x_test, y, training_iterations=iters)
    44
    45
    46if __name__ == "__main__":
    47 ml_pipeline(rs=42, iters=100)

    In this code sample, you start with a standard Python script that trains a Logistic Regression model. Moreover, you have decorated the functions using the Kale SDK. To read more about how to create this file, head to the corresponding Kale SDK user guide.

  5. Compile and run the pipeline, note that Kale takes a snapshot of the Notebook’s volumes:

    $ python3 -m kale kale_parameters.py --kfp [...] 2021-09-21 09:07:50 Kale rokutils:281 [INFO] Creating Rok bucket 'notebooks'... 2021-09-21 09:07:50 Kale rokutils:291 [INFO] Rok bucket 'notebooks' already exists 2021-09-21 09:07:51 Kale rokutils:157 [INFO] Monitoring Rok snapshot with task id: 1556b315b8b54aeaa6a5a241cf75e42a Rok Task: |********************************| 100/100 [...]

    Navigate to the KFP UI to the run page. Note that the new pipeline includes a step called create-volume-1. This step is responsible for provisioning a clone of the Notebook’s workspace volume. If the Notebook had also some data volumes, we would see more pipeline steps called create-volume-*, each one provisioning a clone of a data volume.

    ../../../_images/volume.png

Volumes and Snapshots

When you create your Notebook Server using the Jupyter Web App, by default a new PVC is mounted under the user’s home directory. We call this volume, workspace volume. Moreover, you can decide to provision more volumes to be mounted at locations of your choice. We call these volumes, data volumes.

Whenever you submit a pipeline to KFP, Kale clones the Notebook’s volumes for two important reasons:

  1. Marshalling: a mechanism to seamlessly pass data between steps. This system requires a shared folder where Kale can serialize and de-serialize data.

    Kale uses a hidden folder within the workspace volume as the shared marshalling location.

  2. Reproducibility, experimentation: when you are working on a Notebook, it is often the case that you install new libraries, write new modules, create or download assets required by your code. By seamlessly cloning the workspace and data volumes, all your environment is versioned and replicated to the new pipeline. Thus, the pipeline is always reproducible, thanks to the immutable snapshots, and you do not have to build new Docker images for each pipeline run.

Important

Currently Kale requires to take a snapshot of the Notebook volumes, because every step needs to be sourced from its original Python module or Notebook. In the near future we will provide an option to compile and run pipelines without needing to take volume snapshots, and at the same time ensure the immutability and reproducibility of the pipeline steps.

Marshal Volume

Using the workspace volume to marshal data can become problematic in case your pipeline steps need to pass large assets between them. Imagine if the load step of the pipeline above produced a dataset too large for the workspace volume to contain it. You can ask Kale to provision a volume dedicated to data passing, deciding its size and mount location:

marshal.py
1# Copyright © 2021-2022 Arrikto Inc. All Rights Reserved.
2
3"""Kale SDK.
4-34
4
5This script trains an ML pipeline to solve a binary classification task.
6"""
7
8from kale.sdk import pipeline, step
9from sklearn.datasets import make_classification
10from sklearn.linear_model import LogisticRegression
11from sklearn.model_selection import train_test_split
12
13
14@step(name="data_loading")
15def load(random_state):
16 """Create a random dataset for binary classification."""
17 rs = int(random_state)
18 x, y = make_classification(random_state=rs)
19 return x, y
20
21
22@step(name="data_split")
23def split(x, y):
24 """Split the data into train and test sets."""
25 x, x_test, y, y_test = train_test_split(x, y, test_size=0.1)
26 return x, x_test, y, y_test
27
28
29@step(name="model_training")
30def train(x, x_test, y, training_iterations):
31 """Train a Logistic Regression model."""
32 iters = int(training_iterations)
33 model = LogisticRegression(max_iter=iters)
34 model.fit(x, y)
35 print(model.predict(x_test))
36
37
38-@pipeline(name="binary-classification", experiment="kale-tutorial")
39+@pipeline(name="binary-classification", experiment="kale-tutorial",
40+ marshal_volume=True, marshal_volume_size="10Gi",
41+ marshal_path="/data/marshal")
42def ml_pipeline(rs=42, iters=100):
43 """Run the ML pipeline."""
44 x, y = load(rs)
45-47
45 x, x_test, y, y_test = split(x=x, y=y)
46 train(x, x_test, y, training_iterations=iters)
47
48
49if __name__ == "__main__":
50 ml_pipeline(rs=42, iters=100)

When you activate the marshal volume, Kale uses it to marshal data between steps, but still keeps the Notebook’s clones, for the reasons explained above:

../../../_images/volumemarshal.png

Note

You can also use the marshal volume to perform data passing yourself, in case you need to write files to a shared location and have downstream steps consume them.

Summary

You have learned how Kale uses volumes and snapshots to perform data passing and how to create a marshal volume.

What’s Next

The next step is to learn how to use type-hints for the step parameters of your pipelines.