Serving Testbed

This section includes example manifests like the ones we use in our performance evaluation so that you can spawn an inference service and benchmark it to reproduce our findings.

What You’ll Need

Note

Follow Along: Use tensorflow==2.6.2 and follow the MNIST convent code example and save the model locally. Triton expects the following directory structure:

tf-triton-mnist-repo/ └── tf-triton-mnist ├── 1 │ └── model.savedmodel │ ├── assets │ ├── keras_metadata.pb │ ├── saved_model.pb │ └── variables │ ├── variables.data-00000-of-00001 │ └── variables.index └── config.pbtxt

For the above example, config.pbtxt should look like:

name: "tf-triton-mnist" platform: "tensorflow_savedmodel" max_batch_size: 100 input [ { name: "input_1" data_type: TYPE_FP32 dims: [ 28, 28, 1 ] } ] output [ { name: "dense" data_type: TYPE_FP32 dims: [ 10 ] } ] instance_group [ { count: 10 kind: KIND_CPU } ]

Infrastructure

The infrastructure of our experiment contained an EKS cluster with two node groups:

  • general-workers: m5d.4xlarge (16CPU, 64G RAM) for EKF and clients.
  • huge-workers: m5d.16xlarge (64CPU, 256G RAM) for serving Pods and IGW.

Inference Service

The testbed uses a simple inference service with a Triton server as predictor, serving a TensorFlow model trained with the MNIST dataset. This guide assumes that the model name is tf-triton-mnist. The internal URL of the inference service is http://tf-triton-mnist.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer.

Note

Follow Along: Deploy each of the introduced manifests in the kubeflow-user namespace. You can do this by copying the files content in each of the following steps and pasting them as input to the following command:

$ kubectl apply -n kubeflow-user -f -

You can do this from an environment with access to your EKF cluster and kubectl installed, which could be your management environment or a notebook server in your Arrikto EKF deployment, for example.

  1. Create a PVC:

    pvc.yaml
    1apiVersion: v1
    2kind: PersistentVolumeClaim
    3metadata:
    4-16
    4 name: tf-triton-mnist
    5spec:
    6 accessModes:
    7 - ReadWriteMany
    8 resources:
    9 requests:
    10 storage: 5Gi
    11 storageClassName: rok
    12 volumeMode: Filesystem
    13---
    14apiVersion: kubeflow.org/v1alpha1
    15kind: PVCViewer
    16metadata:
    17 name: tf-triton-mnist
    18spec:
    19 pvc: tf-triton-mnist
  2. Copy your trained model inside the PVC:

    $ kubectl cp tf-triton-mnist-repo kubeflow-user/pvc-viewer-tf-triton-mnist:/tf-triton-mnist/
  3. Create the inference service using the PVC:

    isvc.yaml
    1apiVersion: "serving.kubeflow.org/v1beta1"
    2kind: "InferenceService"
    3metadata:
    4-31
    4 name: "tf-triton-mnist"
    5 annotations:
    6 sidecar.istio.io/inject: "true"
    7 proxy.istio.io/config: |
    8 concurrency: 0
    9 # If this setting is 0, then Activator will be in the request path only
    10 # when the revision is scaled to 0.
    11 autoscaling.knative.dev/targetBurstCapacity: "0"
    12 autoscaling.knative.dev/target: "10"
    13 # This does not work as expected
    14 #queue.sidecar.serving.knative.dev/resourcePercentage: "30"
    15 sidecar.istio.io/proxyCPULimit: 4000m
    16 sidecar.istio.io/proxyCPU: 4000m
    17spec:
    18 predictor:
    19 containerConcurrency: 0
    20 minReplicas: 0
    21 maxReplicas: 10
    22 triton:
    23 storageUri: "pvc://tf-triton-mnist/tf-triton-mnist-repo"
    24 runtimeVersion: 21.10-py3
    25 env:
    26 - name: OMP_NUM_THREADS
    27 value: "1"
    28 resources:
    29 limits:
    30 cpu: 20
    31 memory: 8Gi
    32 requests:
    33 cpu: 20
    34 memory: 8Gi

Clients

This experiment uses jobs running the hey client to hit the inference service and make predictions. The client runs in the same cluster and the same namespace as the inference service.

See also

Note

Follow Along: Deploy each of the introduced manifests in your namespace. You can do this by copying the files content in each of the following steps and pasting them as input to the following command:

$ kubectl apply -n kubeflow-user -f -

You can do this from an environment with access to your EKF cluster and kubectl installed, which could be your management environment or a notebook server in your Arrikto EKF deployment, for example.

  1. Create a Service for direct communication with the Triton server:

    service.yaml
    1apiVersion: v1
    2kind: Service
    3metadata:
    4-19
    4 name: triton
    5spec:
    6 ports:
    7 - name: http
    8 port: 80
    9 protocol: TCP
    10 targetPort: 8080
    11 - name: http-queue
    12 port: 8012
    13 protocol: TCP
    14 targetPort: 8012
    15 - name: http-usermetric
    16 port: 9091
    17 protocol: TCP
    18 targetPort: 9091
    19 selector:
    20 serving.kubeflow.org/inferenceservice: tf-triton-mnist
    21 component: predictor
    22 type: ClusterIP
  2. Add an explicit AuthorizationPolicy for direct communication with the Triton server:

    ap.yaml
    1kind: AuthorizationPolicy
    2metadata:
    3 name: triton
    4-9
    4spec:
    5 action: ALLOW
    6 - to:
    7 - operation:
    8 hosts:
    9 - triton
    10 - triton.kubeflow-user.svc.cluster.local
    11 - tf-triton-mnist
    12 - tf-triton-mnist.kubeflow-user.svc.cluster.local

    Note

    This is primarily because we use clients without an Istio sidecar.

  3. Create a ConfigMap that uses the MNIST dataset as inference input:

    configmap.yaml
    1apiVersion: v1
    2kind: ConfigMap
    3metadata:
    4 name: mnist-input
    5data:
    6 mnist_input.json: '{"model_name": "tf-triton-mnist", "model_version": "1", "inputs": [{"name": "input_1", "datatype": "FP32", "shape": [1, 28, 28, 1], "data": [[[[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.3294117748737335], [0.7254902124404907], [0.6235294342041016], [0.5921568870544434], [0.23529411852359772], [0.1411764770746231], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.8705882430076599], [0.9960784316062927], [0.9960784316062927], [0.9960784316062927], [0.9960784316062927], [0.9450980424880981], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.6666666865348816], [0.20392157137393951], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.26274511218070984], [0.4470588266849518], [0.2823529541492462], [0.4470588266849518], [0.6392157077789307], [0.8901960849761963], [0.9960784316062927], [0.8823529481887817], [0.9960784316062927], [0.9960784316062927], [0.9960784316062927], [0.9803921580314636], [0.8980392217636108], [0.9960784316062927], [0.9960784316062927], [0.5490196347236633], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.06666667014360428], [0.25882354378700256], [0.054901961237192154], [0.26274511218070984], [0.26274511218070984], [0.26274511218070984], [0.23137255012989044], [0.08235294371843338], [0.9254902005195618], [0.9960784316062927], [0.4156862795352936], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.32549020648002625], [0.9921568632125854], [0.8196078538894653], [0.07058823853731155], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.08627451211214066], [0.9137254953384399], [1.0], [0.32549020648002625], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.5058823823928833], [0.9960784316062927], [0.9333333373069763], [0.1725490242242813], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.23137255012989044], [0.9764705896377563], [0.9960784316062927], [0.24313725531101227], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.5215686559677124], [0.9960784316062927], [0.7333333492279053], [0.019607843831181526], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.03529411926865578], [0.8039215803146362], [0.9725490212440491], [0.22745098173618317], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.4941176474094391], [0.9960784316062927], [0.7137255072593689], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.29411765933036804], [0.9843137264251709], [0.9411764740943909], [0.2235294133424759], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.07450980693101883], [0.8666666746139526], [0.9960784316062927], [0.6509804129600525], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0117647061124444], [0.7960784435272217], [0.9960784316062927], [0.8588235378265381], [0.13725490868091583], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.14901961386203766], [0.9960784316062927], [0.9960784316062927], [0.3019607961177826], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.12156862765550613], [0.8784313797950745], [0.9960784316062927], [0.45098039507865906], [0.003921568859368563], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.5215686559677124], [0.9960784316062927], [0.9960784316062927], [0.20392157137393951], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.239215686917305], [0.9490196108818054], [0.9960784316062927], [0.9960784316062927], [0.20392157137393951], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.4745098054409027], [0.9960784316062927], [0.9960784316062927], [0.8588235378265381], [0.1568627506494522], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.4745098054409027], [0.9960784316062927], [0.8117647171020508], [0.07058823853731155], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]]}]}'
  4. Spawn a single job that hits the KServe internal endpoint:

    job.yaml
    1apiVersion: batch/v1
    2kind: Job
    3metadata:
    4-71
    4 name: mnist-triton-load-test
    5spec:
    6 parallelism: 1
    7 template:
    8 metadata:
    9 annotations:
    10 sidecar.istio.io/inject: "false"
    11 spec:
    12 tolerations:
    13 - effect: NoSchedule
    14 operator: Exists
    15 key: node.kubernetes.io/unschedulable
    16 affinity:
    17 podAntiAffinity:
    18 requiredDuringSchedulingIgnoredDuringExecution:
    19 - labelSelector:
    20 matchExpressions:
    21 - key: serving.kubeflow.org/inferenceservice
    22 operator: In
    23 values:
    24 - tf-triton-mnist
    25 topologyKey: "kubernetes.io/hostname"
    26 nodeSelector:
    27 role: general-workers
    28 containers:
    29 - name: hey
    30 image: gcr.io/arrikto-playground/hey:omri1
    31 args:
    32 - -t
    33 - "0"
    34 - -z
    35 - 30s
    36 - -c
    37 - "50" # concurrency
    38 - -m
    39 - POST
    40 - -D
    41 - /app/mnist_input.json
    42 # Kserve internal URL
    43 - http://tf-triton-mnist.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer
    44 # Triton
    45 #- http://triton.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer
    46 # Knative queue-proxy
    47 #- http://triton.kubeflow-user.svc.cluster.local:8012/v2/models/tf-triton-mnist/infer
    48 # Kserve external URL
    49 #- -H
    50 #- "Authorization: Bearer TOKEN"
    51 #- -disable-redirects
    52 #- https://tf-triton-mnist-kubeflow-user.serving.example.com/v2/models/tf-triton-mnist/infer
    53 imagePullPolicy: Always
    54 volumeMounts:
    55 - name: input
    56 mountPath: /app/mnist_input.json
    57 subPath: mnist_input.json
    58 # 1x50 uses 210% CPU
    59 # 1x75 uses 283% CPU
    60 # 1x100 uses 341% CPU
    61 # 1x150 uses 445% CPU
    62 resources:
    63 limits:
    64 cpu: 3
    65 memory: 1Gi
    66 requests:
    67 cpu: 3
    68 memory: 1Gi
    69 restartPolicy: Never
    70 volumes:
    71 - name: input
    72 configMap:
    73 name: mnist-input
    74 defaultMode: 511

Hey Usage

The job above uses the hey client with the following options:

- -t - "0" - -z - 60s - -c - "50" - -m - POST - -D - /app/mnist_input.json

Specifically:

  • -t 0 for infinite request timeout
  • -z 60s for the duration of the job to send requests
  • -c 50 for the number of workers to run concurrently

The above manifest also uses the following parameters that you can adjust according to your needs.

Concurrency

The job above uses concurrency 50. You can change that using the following:

- -c - "25"

If you want more concurrency, ensure that you adjust the resources requests and limits of the job since the client will need more CPU. 3 CPUs should be enough for concurrency up to 75.

Parallelism

Change the parallelism field to use more concurrent jobs:

spec: paralellism: 3

Note

To aggregate results, make sure that all of the Pods start at the same time, that is, none remains Pending due to lack of resources.

Prediction URL

To compare the performance overhead of each component you can change the inference URL to hit specific components directly.

Here are some examples you can use to hit components directly.

Hit the Triton server directly:

- http://triton.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer

Hit the Knative queue-proxy directly:

- http://triton.kubeflow-user.svc.cluster.local:8012/v2/models/tf-triton-mnist/infer

Hit the KServe internal URL:

- http://tf-triton-mnist.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer

Hit the KServe external URL:

- -H - "Authorization: Bearer <TOKEN>" - -disable-redirects - http://tf-triton-mnist-kubeflow-user.serving.example.com/v2/models/tf-triton-mnist/infer

Replace <TOKEN> with a short-lived token that you obtain following the Access Services with External Clients guide.

See also

Output

After the job completes, inspect each Pod logs to see what hey reports:

$ kubectl get pods -n kubeflow-user -l job-name=mnist-triton-load-test -o name | \ > xargs -n1 kubectl logs -n kubeflow-user Summary: Total: 60.0067 secs Slowest: 0.1503 secs Fastest: 0.0023 secs Average: 0.0048 secs Requests/sec: 10479.6404 Response time histogram: 0.002 [1] | 0.017 [628705] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.032 [91] | 0.047 [0] | 0.061 [2] | 0.076 [0] | 0.091 [0] | 0.106 [0] | 0.121 [4] | 0.135 [37] | 0.150 [9] | Latency distribution: 10% in 0.0035 secs 25% in 0.0038 secs 50% in 0.0044 secs 75% in 0.0053 secs 90% in 0.0067 secs 95% in 0.0076 secs 99% in 0.0093 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0000 secs, 0.0061 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0052 secs req write: 0.0000 secs, 0.0000 secs, 0.0036 secs resp wait: 0.0047 secs, 0.0022 secs, 0.1439 secs resp read: 0.0000 secs, 0.0000 secs, 0.0089 secs Status code distribution: [200] 628849 responses

From the above output note:

  • RPS (Requests/sec in Summary)
  • p95% LAT (95% in Latency distribution)

What’s Next

To learn more about Serving on EKF, check out the rest of our user guides for Serving.