Kale Notebook Cell Types¶
To create a Kubeflow Pipeline (KFP) from a Jupyter Notebook using Kale, annotate the cells of your notebook selecting from six Kale cell types. Some of the cell types require a small number of parameters.
Kale uses the annotations you supply to define a Kubeflow pipeline. Each step of the pipeline will run in its own container in a Kubernetes deployment. The annotations you apply to cells in your notebook enable Kale to manage dependencies for each step and marshal data correctly as inputs and outputs for each step of a pipeline. See below for the list of cell types and a brief summary of each.
|Cell type||Cell should contain|
|Imports||Blocks of code that import other modules your machine learning pipeline requires and may be needed by more than one step.|
|Functions||Functions used later in your machine learning pipeline; global variable definitions (other than pipeline parameters); and code that initializes lists, dictionaries, objects, and other values used throughout your pipeline.|
|Pipeline Parameters||Definitions for global variables used to parameterize your machine learning workflow. These are often training hyperparameters.|
|Pipeline Metrics||Lines of code that log or print values used to measure the success of your model.|
|Pipeline Step||Code that implements the core logic of a discrete step in your workflow.|
|Skip Cell||Any code that you want Kale to ignore.|
Annotate notebook cells with the label Imports to identify blocks of code that import other modules your machine learning pipeline requires.
Imports cells help Kale identify all dependencies for pipeline steps. Kale prepends the code in Imports cells to the code specific to a pipeline step in the execution environment it creates for that step. See How Kale Creates a Pipeline Step for more detail.
Annotate notebook cells with the label Functions to identify blocks of code containing:
- Functions used later in your machine learning pipeline.
- Global variable definitions (other than pipeline parameters) and code that initializes lists, dictionaries, objects, and other values used throughout your pipeline.
Though pipeline parameters are often written as global variables, you should annotate pipeline parameters using the Pipeline Parameters label. This will enable Kale to configure the Kubeflow pipeline it defines with the appropriate input parameters.
Functions cells help Kale identify all dependencies for pipeline steps. Kale creates pipeline steps by prepending Imports cells followed by Functions cells to the code specific to a pipeline step in the execution environment it creates for that step. See How Kale Creates a Pipeline Step for more detail.
Annotate Functions Cells¶
To identify functions, global variable declarations, and initialization code, edit the first cell in a block containing this code by clicking the pencil icon in the upper right corner and select Cell type > Functions.
Pipeline Parameters Cells¶
Annotate notebook cells as Pipeline Parameters to identify blocks of code that define global variables used as inputs to specify some controls on the operation of a machine learning pipeline. These should be values that you might experiment with as you evaluate the relative performance of a pipeline run with different settings.
Kale uses the values in Pipeline Parameters cells to define Kubeflow Pipeline
PipelineParam objects and initializes the KF Pipeline with these
parameters. KFP includes pipeline parameters values in the artifacts it creates
for pipeline runs to facilitate review of results from experiments comparing
multiple runs of a pipeline.
Annotate Pipeline Parameters Cells¶
To annotate pipeline parameters, edit the first cell containing pipeline parameters by clicking the pencil icon in the upper right corner and select Cell type > Pipeline Parameters.
Pipeline Step Cells¶
Annotate notebook cells with the label Pipeline Step to identify code that implements one of the main components or tasks of a machine learning workflow. A pipeline step typically represents a milestone in data preparation, training, evaluation, tuning, prediction or other phases of a workflow.
Kale creates pipeline steps by prepending Imports cells followed by Functions cells to cells annotated for a particular Pipeline Step. These cells together comprise the code Kale uses in the execution environment it creates for a pipeline step. See How Kale Creates a Pipeline Step for more detail.
Annotate Pipeline Step Cells¶
To identify code that implements a step in a machine learning workflow:
Edit the first cell containing this code by clicking the pencil icon in the upper right corner and select Cell type > Pipeline Step.
Specify a unique step name.
(Optional) Select one or more steps that the step depends on.
(Optional) Specify that this step should run on a GPU node.
Step name Parameter¶
Step name is the label by which you reference a step in a pipeline. As the step name, create a label that is unique and descriptive. You will use this name as a reference as you define dependency relationships between steps in your pipeline.
The step name must consist of only lowercase alphanumeric characters or '_'. The first character must be a lowercase letter.
Depends on Parameter¶
The values you select for Depends on list the other steps that must execute before the step you are annotating.
To add dependencies, use the Depends on pull-down menu to select each step whose output will serve as input for the step you are annotating.
In the example below, since the step eval_custom evaluates the model created in the step custom_classifier, we select that step from the Depends on pull-down menu.
When selecting steps using the Depends on pull-down menu, identify only steps that are immediate dependencies. Do not include all dependencies back through the machine learning pipeline.
Together, the dependencies for all steps in a pipeline, define the execution graph for that pipeline. This helps Kale determine, for example, whether there are branches of your pipeline that can run in parallel.
Specify Multiple Dependencies: A given step may depend on the outputs from more than one other step. The Depends on pull-down menu enables you to select as many other steps as necessary. Select each dependency one at a time.
Remove Dependencies: To remove a dependency already selected, select the name of that step again from the Depends on pull-down menu. The items in this menu function as toggles for specifying other steps as dependencies.
Click the GPU button when annotating a step, to require that step to run on a GPU. In the modal that appears, enable this requirement using the toggle and specify a number of GPUs and the type of GPU requested.
Parallel Pipeline Steps¶
Pipeline steps that are independent of one another can run in parallel. For example, the two steps represented below each depend on a step named process_data, but are otherwise independent. Kale uses the dependency graph reflected in the way you define pipeline steps to orchestrate pipeline runs, taking advantage of your Kubernetes infrastructure to run a pipeline as efficiently as possible.
The step vgg16_classifier can run in parallel with the step custom_classifier.
Use Skip to annotate notebook cells that you want Kale to ignore as it defines a Kubeflow pipeline.
Common uses of the Skip annotation include identifying console logging and other diagnostic code useful in developing a step of pipeline, but which is not part of your machine learning workflow.
Annotate Skip Cells¶
To annotate skip cells, edit the first cell containing code you want Kale to ignore by clicking the pencil icon in the upper right corner and select Cell type > Skip Cell.
Pipeline Metrics Cells¶
Annotate a notebook cell with the label Pipeline Metrics to identify code that outputs the results you want to evaluate for a pipeline run.
Based on the variables referenced in a Pipeline Metrics cell, Kale will define pipeline metrics that the Kubeflow Pipelines (KFP) system will produce for every pipeline run. In addition, Kale will associate each one of these metrics to the steps that produced them. Tracking pipeline metrics is essential to evaluating performance across multiple runs of a pipeline that have been parameterized differently or modified while still in the experimental phase of developing a model.
Pipeline metrics are also key to the AutoML capabilities of Kubeflow and Kale. For example, you will need to choose a single pipeline metric as the search objective metric for hyperparameter tuning experiments.
Annotate Pipeline Metrics Cell¶
Pipeline metrics should be considered the result of pipeline execution, not the result of an individual step. You should only annotate one cell with Pipeline Metrics and that cell should be the last cell in your notebook.
To identify pipeline metrics, edit the cell containing pipeline metrics statements by clicking the pencil icon in the upper right corner and select Cell type > Pipeline Metrics.