Serve Scikit Learn Models¶

This section will guide you through serving a Scikit Learn model, using the Kale serve API.

Overview

What You’ll Need
Procedure
Summary
What’s Next

What You’ll Need ¶

An Arrikto EKF or MiniKF deployment with the default Kale Docker image.
An understanding of how the Kale SDK works.
An understanding of how the Kale serve API works.

This guide comprises three sections: In the first section, you will explore and process the dataset. Then, in the second section, you will leverage the Kale SDK to build a Machine Learning (ML) pipeline that trains and serves a Scikit Learn model. Finally, in the third section, you will invoke the model service to get predictions on a holdout test subset.

Explore Dataset ¶

In this section, you will work with the 20newsgroups dataset. The 20newsgroups dataset consists of around 18000 newsgroups posts on 20 topics, split in two subsets: one for training and another one for testing. The end goal is to classify each post into one of the 20 topics.

Create a new notebook server using the default Kale Docker image. The image will have the following naming scheme:

gcr.io/arrikto/jupyter-kale-py38:<IMAGE_TAG>

Note

The <IMAGE_TAG> varies based on the MiniKF or Arrikto EKF release.
Connect to the server and create a new Jupyter notebook (that is, an IPYNB file):
Copy and paste the import statements in the first code cell, and run it:

- hide: code

import json from sklearn.feature_extraction import text from sklearn.datasets import fetch_20newsgroups from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from kale.serve import Endpoint

This is how your notebook cell will look like:
In a different code cell, fetch the dataset and print the topic names. Copy and paste the following code, and run it:

- hide: code

# download dataset newsgroups_dataset = fetch_20newsgroups(random_state=42) # dataset target groups class_names = newsgroups_dataset.target_names print(*class_names, sep = "\n")

This is how your notebook cell will look like:

The output of the cell prints the 20 targets. You can see that posts in this dataset are classified into a diverse set of topics, including religion, politics, and sports.
Load the features and targets of the dataset, and split it into train and test subsets. In a new cell, copy and paste the following code, and run it:

- hide: code

# create the dataset x = newsgroups_dataset.data y = newsgroups_dataset.target # split the dataset x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=42)

This is how your notebook cell will look like:
Run the following code in a new cell to visualize an example from the training subset:

- hide: code

# print a random example and its topic index = 30 print(x_train[index]) print("Topic:", class_names[y_train[index]])

This is how your notebook cell will look like:

The output of the cell prints the text of a random example and its topic. You can see that the post asks a question about image enhancement and it is classified under the topic of comp.graphics.
Use the TF-IDF vectorizer to transform the raw training and test subsets into a form that you can use to train a machine learning model:

- hide: code

# calculate TF-IDF vectors stop_words = text.ENGLISH_STOP_WORDS vectorizer = TfidfVectorizer(stop_words=stop_words) x_train_transformed = vectorizer.fit_transform(x_train) x_test_transformed = vectorizer.transform(x_test)

This is how your notebook cell will look like:

TF-IDF stands for Term Frequency - Inverse Document Frequency, and is a statistical term that evaluates the importance of a word within a document relative to a corpus. It computes the product of two terms:
- Term Frequency (TF): computes the frequency of words appearing in a document.
- Inverse Document Frequency (IDF): provides you with the importance of each word by weighting down the frequent words and scaling up the rare ones.

Serve Scikit Learn Model ¶

In this section, you will build a pipeline that trains a Naive Bayes classifier to categorize the posts into different topics.

In the same notebook server, open a terminal, create a new Python file, and name it serve_sklearn_model.py:

$ touch serve_sklearn_model.py
Create a new folder where you will place the transformer assets:

$ mkdir transformer_package
Inside the transformer folder, create a new Python file, and name it transformer.py:

$ cd transformer_package && touch transformer.py

Copy and paste the following code inside transformer.py:

sklearn_transformer.py

1# Copyright © 2022 Arrikto Inc.  All Rights Reserved.
2
3"""Transformer.
4
5This script defines a serving transformer which can preprocess raw data
6and postprocess the predictions.
7"""
8
9import joblib
10import kserve
11
12from kale.serve import utils
13from typing import Dict
14
15
16class_names = ['alt.atheism', 'comp.graphics',
17               'comp.os.ms-windows.misc',
18               'comp.sys.ibm.pc.hardware',
19               'comp.sys.mac.hardware',
20               'comp.windows.x', 'misc.forsale',
21               'rec.autos', 'rec.motorcycles',
22               'rec.sport.baseball', 'rec.sport.hockey',
23               'sci.crypt', 'sci.electronics', 'sci.med',
24               'sci.space', 'soc.religion.christian',
25               'talk.politics.guns', 'talk.politics.mideast',
26               'talk.politics.misc', 'talk.religion.misc']
27
28
29class Transformer(kserve.Model):
30    """Transform the data.
31
32    Vectorize the input data before passing it to the
33    model and return human-readable predictions.
34
35    Args:
36        name (str): The name of the Transformer
37        predictor_host (str): The host address of the Predictor
38    """
39
40    def __init__(self, model_name: str, predictor_host: str,
41                 protocol: str = "v1"):
42        super().__init__(model_name)
43        self.predictor_host = predictor_host
44        self.protocol = protocol
45
46        # load the vectorizer object
47        path = utils.get_transformer_asset("vectorizer.joblib")
48        with open(path, "rb") as f:
49            self.vectorizer = joblib.load(f)
50
51    def preprocess(self, inputs: Dict):
52        """Preprocess the dataset."""
53        transformed_data = self.vectorizer.transform(inputs["instances"])
54        return {'instances': transformed_data.toarray().tolist()}
55
56    def postprocess(self, inputs: Dict):
57        """Postprocess the predictions."""
58        return {"predictions": [class_names[i] for i in inputs["predictions"]]}

The Transformer class you defined extends the kserve.Model class, and overrides the preprocess and postprocess methods.

KServe calls the preprocess method before the server feeds the data to the model, to transform them in a form that the model understands.
KServe calls the postprocess method on the model’s predictions, to return a human-readable result.

The preprocess method has a global dependency: a TF-IDF vectorizer. To load this dependency, use the get_transformer_asset function, which knows how to find the file. More on this later, as you build the training pipeline.

Return back to your home environment:

$ cd

Copy and paste the following code inside serve_sklearn_model.py:

sklearn_starter.py

1# Copyright © 2022 Arrikto Inc.  All Rights Reserved.
2
3"""Kale SDK.
4
5This script uses an ML pipeline to train and serve an SKLearn Model.
6"""
7
8import os
9import joblib
10
11from typing import Tuple, NamedTuple
12
13from sklearn.feature_extraction import text
14from sklearn.naive_bayes import MultinomialNB
15from sklearn.datasets import fetch_20newsgroups
16from sklearn.model_selection import train_test_split
17from sklearn.feature_extraction.text import TfidfVectorizer
18
19from kale.types import MarshalData
20from kale.sdk import pipeline, step
21from kale.common import mlmdutils, artifacts
22
23
24ASSETS_PATH = "/home/jovyan/transformer_package/"
25
26
27@step(name="data_loading")
28def load_split_dataset() -> Tuple[MarshalData, MarshalData]:
29    """Fetch 20newgroup dataset."""
30    # load the data
31    newsgroups_dataset = fetch_20newsgroups(random_state=42)
32    x = newsgroups_dataset.data
33    y = newsgroups_dataset.target
34
35    x, _, y, _ = train_test_split(x, y, test_size=.2, random_state=42)
36
37    return x, y
38
39
40@step(name="data_preprocess")
41def preprocess(x: MarshalData) -> Tuple[MarshalData, int]:
42    """Preprocess the input data."""
43    # get stopwords
44    stop_words = text.ENGLISH_STOP_WORDS
45    # TF-IDF vectors
46    vectorizer = TfidfVectorizer(stop_words=stop_words)
47    x_processed = vectorizer.fit_transform(x)
48
49    with open(os.path.join(ASSETS_PATH, "vectorizer.joblib"), "wb") as f:
50        joblib.dump(vectorizer, f)
51
52    # create and submit a Transformer artifact
53    mlmd = mlmdutils.get_mlmd_instance()
54
55    transformer_artifact = artifacts.Transformer(
56        name="Vectorizer",
57        transformer_dir=ASSETS_PATH,
58        module_name="transformer",
59        class_name="Transformer",
60        is_stateful=True
61    ).submit_artifact()
62
63    mlmd.link_artifact_as_output(transformer_artifact.id)
64
65    return x_processed, transformer_artifact.id
66
67
68@step(name="model_training")
69def train(x: MarshalData,
70          y: MarshalData) -> NamedTuple("outs", [("model", MarshalData)]):
71    """Train a MultinomialNB model."""
72    classifier = MultinomialNB(alpha=.01)
73    model = classifier.fit(x, y)
74    return model
75
76
77@pipeline(name="classification", experiment="sklearn-tutorial")
78def ml_pipeline():
79    """Run the ML pipeline."""
80    x, y = load_split_dataset()
81    x_processed, _ = preprocess(x)
82    train(x_processed, y)
83
84
85if __name__ == "__main__":
86    ml_pipeline()

This script defines a KFP run using the Kale SDK. Specifically, it defines a pipeline with three steps:

The first step (data_loading) loads and splits the 20newsgroups dataset.
The second step (data_preprocess) transforms the raw datasets using the TF-IDF vectorizer and creates a Transformer artifact.
The third step (model_training) trains a Naive Bayes classifier.

Pay closer attention to the preprocess step. This step uses a TF-IDF vectorizer to transform the raw datasets into a form that the model can understand. Then, it saves the vectorizer variable inside the transformer_package folder you created previously. Finally, it creates a Transformer artifact by passing the directory of the transformer assets, the name of the transformer module, and the name of the transformer class.

Kale will

move the transformer_package folder to a location it controls (that’s how the get_transformer_asset function knows how to retrieve the assets), and
create and submit a kale.Transformer artifact to MLMD.

See also

How to submit a Kale Transformer artifact to MLMD.

Create a new step function which logs an SKLearnModel artifact, using the Kale API. The following snippet summarizes the changes in code:

Important

Running these pipelines locally won’t work. After introducing register_model step, run the pipeline as a KFP pipeline since this step creates a Kubeflow artifact.

sklearn_log_model_artifact.py

1# Copyright © 2022 Arrikto Inc.  All Rights Reserved.
2
3"""Kale SDK.
4-15
4
5This script uses an ML pipeline to train and serve an SKLearn Model.
6"""
7
8import os
9import joblib
10
11from typing import Tuple, NamedTuple
12
13from sklearn.feature_extraction import text
14from sklearn.naive_bayes import MultinomialNB
15from sklearn.datasets import fetch_20newsgroups
16from sklearn.model_selection import train_test_split
17from sklearn.feature_extraction.text import TfidfVectorizer
18
19+from kale.ml import Signature
20from kale.types import MarshalData
21from kale.sdk import pipeline, step
22from kale.common import mlmdutils, artifacts
23-74
23
24
25ASSETS_PATH = "/home/jovyan/transformer_package/"
26
27
28@step(name="data_loading")
29def load_split_dataset() -> Tuple[MarshalData, MarshalData]:
30    """Fetch 20newgroup dataset."""
31    # load the data
32    newsgroups_dataset = fetch_20newsgroups(random_state=42)
33    x = newsgroups_dataset.data
34    y = newsgroups_dataset.target
35
36    x, _, y, _ = train_test_split(x, y, test_size=.2, random_state=42)
37
38    return x, y
39
40
41@step(name="data_preprocess")
42def preprocess(x: MarshalData) -> Tuple[MarshalData, int]:
43    """Preprocess the input data."""
44    # get stopwords
45    stop_words = text.ENGLISH_STOP_WORDS
46    # TF-IDF vectors
47    vectorizer = TfidfVectorizer(stop_words=stop_words)
48    x_processed = vectorizer.fit_transform(x)
49
50    with open(os.path.join(ASSETS_PATH, "vectorizer.joblib"), "wb") as f:
51        joblib.dump(vectorizer, f)
52
53    # create and submit a Transformer artifact
54    mlmd = mlmdutils.get_mlmd_instance()
55
56    transformer_artifact = artifacts.Transformer(
57        name="Vectorizer",
58        transformer_dir=ASSETS_PATH,
59        module_name="transformer",
60        class_name="Transformer",
61        is_stateful=True
62    ).submit_artifact()
63
64    mlmd.link_artifact_as_output(transformer_artifact.id)
65
66    return x_processed, transformer_artifact.id
67
68
69@step(name="model_training")
70def train(x: MarshalData,
71          y: MarshalData) -> NamedTuple("outs", [("model", MarshalData)]):
72    """Train a MultinomialNB model."""
73    classifier = MultinomialNB(alpha=.01)
74    model = classifier.fit(x, y)
75    return model
76
77
78+@step(name="register_model")
79+def register_model(model: MarshalData, x: MarshalData, y: MarshalData) -> int:
80+    mlmd = mlmdutils.get_mlmd_instance()
81+
82+    signature = Signature(
83+        input_size=[1] + list(x[0].shape),
84+        output_size=[1] + list(y[0].shape),
85+        input_dtype=x.dtype,
86+        output_dtype=y.dtype)
87+
88+    model_artifact = artifacts.SklearnModel(
89+        model=model,
90+        description="A simple MultinomialNB model",
91+        version="1.0.0",
92+        author="Kale",
93+        signature=signature,
94+        tags={"app": "sklearn-tutorial"}).submit_artifact()
95+
96+    mlmd.link_artifact_as_output(model_artifact.id)
97+    return model_artifact.id
98+
99+
100@pipeline(name="classification", experiment="sklearn-tutorial")
101def ml_pipeline():
102    """Run the ML pipeline."""
103    x, y = load_split_dataset()
104    x_processed, _ = preprocess(x)
105-    train(x_processed, y)
106+    model = train(x_processed, y)
107+    register_model(model, x_processed, y)
108
109
110if __name__ == "__main__":
111    ml_pipeline()

Create a new step function which serves the SKLearnModel artifact you created in the previous step, using the Kale serve API. The following snippet summarizes the changes in code:

sklearn_serve.py

1# Copyright © 2022 Arrikto Inc.  All Rights Reserved.
2
3"""Kale SDK.
4-15
4
5This script uses an ML pipeline to train and serve an SKLearn Model.
6"""
7
8import os
9import joblib
10
11from typing import Tuple, NamedTuple
12
13from sklearn.feature_extraction import text
14from sklearn.naive_bayes import MultinomialNB
15from sklearn.datasets import fetch_20newsgroups
16from sklearn.model_selection import train_test_split
17from sklearn.feature_extraction.text import TfidfVectorizer
18
19+from kale.serve import serve
20from kale.ml import Signature
21from kale.types import MarshalData
22from kale.sdk import pipeline, step
23-97
23from kale.common import mlmdutils, artifacts
24
25
26ASSETS_PATH = "/home/jovyan/transformer_package/"
27
28
29@step(name="data_loading")
30def load_split_dataset() -> Tuple[MarshalData, MarshalData]:
31    """Fetch 20newgroup dataset."""
32    # load the data
33    newsgroups_dataset = fetch_20newsgroups(random_state=42)
34    x = newsgroups_dataset.data
35    y = newsgroups_dataset.target
36
37    x, _, y, _ = train_test_split(x, y, test_size=.2, random_state=42)
38
39    return x, y
40
41
42@step(name="data_preprocess")
43def preprocess(x: MarshalData) -> Tuple[MarshalData, int]:
44    """Preprocess the input data."""
45    # get stopwords
46    stop_words = text.ENGLISH_STOP_WORDS
47    # TF-IDF vectors
48    vectorizer = TfidfVectorizer(stop_words=stop_words)
49    x_processed = vectorizer.fit_transform(x)
50
51    with open(os.path.join(ASSETS_PATH, "vectorizer.joblib"), "wb") as f:
52        joblib.dump(vectorizer, f)
53
54    # create and submit a Transformer artifact
55    mlmd = mlmdutils.get_mlmd_instance()
56
57    transformer_artifact = artifacts.Transformer(
58        name="Vectorizer",
59        transformer_dir=ASSETS_PATH,
60        module_name="transformer",
61        class_name="Transformer",
62        is_stateful=True
63    ).submit_artifact()
64
65    mlmd.link_artifact_as_output(transformer_artifact.id)
66
67    return x_processed, transformer_artifact.id
68
69
70@step(name="model_training")
71def train(x: MarshalData,
72          y: MarshalData) -> NamedTuple("outs", [("model", MarshalData)]):
73    """Train a MultinomialNB model."""
74    classifier = MultinomialNB(alpha=.01)
75    model = classifier.fit(x, y)
76    return model
77
78
79@step(name="register_model")
80def register_model(model: MarshalData, x: MarshalData, y: MarshalData) -> int:
81    mlmd = mlmdutils.get_mlmd_instance()
82
83    signature = Signature(
84        input_size=[1] + list(x[0].shape),
85        output_size=[1] + list(y[0].shape),
86        input_dtype=x.dtype,
87        output_dtype=y.dtype)
88
89    model_artifact = artifacts.SklearnModel(
90        model=model,
91        description="A simple MultinomialNB model",
92        version="1.0.0",
93        author="Kale",
94        signature=signature,
95        tags={"app": "sklearn-tutorial"}).submit_artifact()
96
97    mlmd.link_artifact_as_output(model_artifact.id)
98    return model_artifact.id
99
100
101+@step(name="serve_model")
102+def serve_model(model_artifact_id: int, transformer_artifact_id: int):
103+    serve_config = {"limits": {"memory": "4Gi"},
104+                    "annotations": {"sidecar.istio.io/inject": "false"}}
105+    serve(name="sklearn-tutorial",
106+          model_id=model_artifact_id,
107+          transformer_id=transformer_artifact_id,
108+          serve_config=serve_config)
109+
110+
111@pipeline(name="classification", experiment="sklearn-tutorial")
112def ml_pipeline():
113    """Run the ML pipeline."""
114    x, y = load_split_dataset()
115-    x_processed, _ = preprocess(x)
116+    x_processed, transformer_artifact_id = preprocess(x)
117    model = train(x_processed, y)
118-    register_model(model, x_processed, y)
119+    artifact_id = register_model(model, x_processed, y)
120+    serve_model(artifact_id, transformer_artifact_id)
121
122
123if __name__ == "__main__":
124    ml_pipeline()

Deploy and run your code as a KFP pipeline:

$ python3 -m kale serve_sklearn_model.py --kfp
Select Runs to view the KFP run you just created. This is what it looks like when the pipeline completes successfully:
When the register_model step completes, you can view the model artifact through the KFP UI:
Wait until the pipeline completes. Check the Logs tab of the serve_model step to see whether the InferenceService is running.
Select Models and click on the endpoint you created:

Get Predictions ¶

In this section, you will query the model endpoint to get predictions for the posts in the test subset.

Navigate to the Models UI to retrieve the name of the InferenceService. In this example, it is sklearn-tutorial.
In the existing notebook, in a different code cell, initialize a Kale Endpoint object using the name of the InferenceService you retrieved in the previous step. Then, run the cell:

- hide: code

endpoint = Endpoint(name="sklearn-tutorial")

Note

When initializing an Endpoint, you can also pass the namespace of the InferenceService. For example, if your namespace is my-namespace:

- hide: code

endpoint = Endpoint(name="sklearn-tutorial", namespace="my-namespace")

If you do not provide one, Kale assumes the namespace of the notebook server. In our case it is kubeflow-user.

This is how your notebook cell will look like:
Visualize a test sample and transform the data into JSON format. Copy and paste the following code in a new cell, and run it:

- hide: code

# visualize the test sample you will use index_test = 2 print(x_test[index_test]) print("Topic:", class_names[y_test[index_test]])

This is how your notebook cell will look like:
Prepare the data payload for the prediction request. Copy and paste the following code in a new cell, and run it:

- hide: code

# covert the test sample into json format data = {"instances": [x_test[index_test]]}

This is how your notebook cell will look like:
Invoke the server to get predictions. Copy and paste the following snippet in a different code cell, and run it:

- hide: code

# get and print the prediction res = endpoint.predict(json.dumps(data)) print(f"The prediction is {res['predictions']}")

This is how your notebook cell will look like:

Summary ¶

You have successfully created a Kubeflow pipeline that trains a Scikit Learn model, logs it in MLMD, and creates a model endpoint using the Kale serve API.

What’s Next ¶

Check out how you can serve a TensorFlow model.

Serve TensorFlow Models

Serve Scikit Learn Models¶

What You’ll Need¶

Procedure¶

Explore Dataset¶

Serve Scikit Learn Model¶

Get Predictions¶

Summary¶

What’s Next¶