Serving Testbed¶
This section includes example manifests like the ones we use in our performance evaluation so that you can spawn an inference service and benchmark it to reproduce our findings.
See also
What You’ll Need¶
- A trained TensorFlow model.
- An existing EKF deployment.
- An exposed Serving service.
- A environment with access to your EKF cluster.
Note
Follow Along: Use tensorflow==2.6.2
and follow the MNIST convent
code example and save
the model locally. Triton expects the following directory structure:
For the above example, config.pbtxt
should look like:
See also
Infrastructure¶
The infrastructure of our experiment contained an EKS cluster with two node groups:
- general-workers:
m5d.4xlarge
(16CPU, 64G RAM) for EKF and clients. - huge-workers:
m5d.16xlarge
(64CPU, 256G RAM) for serving Pods and IGW.
Inference Service¶
The testbed uses a simple inference service with a Triton server as predictor,
serving a TensorFlow model trained with the MNIST dataset. This guide assumes
that the model name is tf-triton-mnist. The internal URL of the inference
service is
http://tf-triton-mnist.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer
.
Note
Follow Along: Deploy each of the introduced manifests in the
kubeflow-user
namespace.
You can do this by copying the files content in each of the following steps
and pasting them as input to the following command:
You can do this from an environment with access to your EKF cluster and
kubectl
installed, which could be your management environment or a notebook server in your Arrikto EKF deployment, for example.
Create a PVC:
pvc.yaml1 apiVersion: v1 2 kind: PersistentVolumeClaim 3 metadata: 4-16 4 name: tf-triton-mnist 5 spec: 6 accessModes: 7 - ReadWriteMany 8 resources: 9 requests: 10 storage: 5Gi 11 storageClassName: rok 12 volumeMode: Filesystem 13 --- 14 apiVersion: kubeflow.org/v1alpha1 15 kind: PVCViewer 16 metadata: 17 name: tf-triton-mnist 18 spec: 19 pvc: tf-triton-mnist Copy your trained model inside the PVC:
$ kubectl cp tf-triton-mnist-repo kubeflow-user/pvc-viewer-tf-triton-mnist:/tf-triton-mnist/Create the inference service using the PVC:
isvc.yaml1 apiVersion: "serving.kubeflow.org/v1beta1" 2 kind: "InferenceService" 3 metadata: 4-31 4 name: "tf-triton-mnist" 5 annotations: 6 sidecar.istio.io/inject: "true" 7 proxy.istio.io/config: | 8 concurrency: 0 9 # If this setting is 0, then Activator will be in the request path only 10 # when the revision is scaled to 0. 11 autoscaling.knative.dev/targetBurstCapacity: "0" 12 autoscaling.knative.dev/target: "10" 13 # This does not work as expected 14 #queue.sidecar.serving.knative.dev/resourcePercentage: "30" 15 sidecar.istio.io/proxyCPULimit: 4000m 16 sidecar.istio.io/proxyCPU: 4000m 17 spec: 18 predictor: 19 containerConcurrency: 0 20 minReplicas: 0 21 maxReplicas: 10 22 triton: 23 storageUri: "pvc://tf-triton-mnist/tf-triton-mnist-repo" 24 runtimeVersion: 21.10-py3 25 env: 26 - name: OMP_NUM_THREADS 27 value: "1" 28 resources: 29 limits: 30 cpu: 20 31 memory: 8Gi 32 requests: 33 cpu: 20 34 memory: 8Gi
Clients¶
This experiment uses jobs running the hey
client to hit the inference
service and make predictions. The client runs in the same cluster and the same
namespace as the inference service.
See also
Note
Follow Along: Deploy each of the introduced manifests in your namespace. You can do this by copying the files content in each of the following steps and pasting them as input to the following command:
You can do this from an environment with access to your EKF cluster and
kubectl
installed, which could be your management environment or a notebook server in your Arrikto EKF deployment, for example.
Create a Service for direct communication with the Triton server:
service.yaml1 apiVersion: v1 2 kind: Service 3 metadata: 4-19 4 name: triton 5 spec: 6 ports: 7 - name: http 8 port: 80 9 protocol: TCP 10 targetPort: 8080 11 - name: http-queue 12 port: 8012 13 protocol: TCP 14 targetPort: 8012 15 - name: http-usermetric 16 port: 9091 17 protocol: TCP 18 targetPort: 9091 19 selector: 20 serving.kubeflow.org/inferenceservice: tf-triton-mnist 21 component: predictor 22 type: ClusterIP Add an explicit AuthorizationPolicy for direct communication with the Triton server:
ap.yaml1 kind: AuthorizationPolicy 2 metadata: 3 name: triton 4-9 4 spec: 5 action: ALLOW 6 - to: 7 - operation: 8 hosts: 9 - triton 10 - triton.kubeflow-user.svc.cluster.local 11 - tf-triton-mnist 12 - tf-triton-mnist.kubeflow-user.svc.cluster.local Note
This is primarily because we use clients without an Istio sidecar.
Create a ConfigMap that uses the MNIST dataset as inference input:
configmap.yaml1 apiVersion: v1 2 kind: ConfigMap 3 metadata: 4 name: mnist-input 5 data: 6 mnist_input.json: '{"model_name": "tf-triton-mnist", "model_version": "1", "inputs": [{"name": "input_1", "datatype": "FP32", "shape": [1, 28, 28, 1], "data": [[[[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.3294117748737335], [0.7254902124404907], [0.6235294342041016], [0.5921568870544434], [0.23529411852359772], [0.1411764770746231], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.8705882430076599], [0.9960784316062927], [0.9960784316062927], [0.9960784316062927], [0.9960784316062927], [0.9450980424880981], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.7764706015586853], [0.6666666865348816], [0.20392157137393951], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.26274511218070984], [0.4470588266849518], [0.2823529541492462], [0.4470588266849518], [0.6392157077789307], [0.8901960849761963], [0.9960784316062927], [0.8823529481887817], [0.9960784316062927], [0.9960784316062927], [0.9960784316062927], [0.9803921580314636], [0.8980392217636108], [0.9960784316062927], [0.9960784316062927], [0.5490196347236633], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.06666667014360428], [0.25882354378700256], [0.054901961237192154], [0.26274511218070984], [0.26274511218070984], [0.26274511218070984], [0.23137255012989044], [0.08235294371843338], [0.9254902005195618], [0.9960784316062927], [0.4156862795352936], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.32549020648002625], [0.9921568632125854], [0.8196078538894653], [0.07058823853731155], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.08627451211214066], [0.9137254953384399], [1.0], [0.32549020648002625], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.5058823823928833], [0.9960784316062927], [0.9333333373069763], [0.1725490242242813], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.23137255012989044], [0.9764705896377563], [0.9960784316062927], [0.24313725531101227], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.5215686559677124], [0.9960784316062927], [0.7333333492279053], [0.019607843831181526], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.03529411926865578], [0.8039215803146362], [0.9725490212440491], [0.22745098173618317], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.4941176474094391], [0.9960784316062927], [0.7137255072593689], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.29411765933036804], [0.9843137264251709], [0.9411764740943909], [0.2235294133424759], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.07450980693101883], [0.8666666746139526], [0.9960784316062927], [0.6509804129600525], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0117647061124444], [0.7960784435272217], [0.9960784316062927], [0.8588235378265381], [0.13725490868091583], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.14901961386203766], [0.9960784316062927], [0.9960784316062927], [0.3019607961177826], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.12156862765550613], [0.8784313797950745], [0.9960784316062927], [0.45098039507865906], [0.003921568859368563], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.5215686559677124], [0.9960784316062927], [0.9960784316062927], [0.20392157137393951], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.239215686917305], [0.9490196108818054], [0.9960784316062927], [0.9960784316062927], [0.20392157137393951], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.4745098054409027], [0.9960784316062927], [0.9960784316062927], [0.8588235378265381], [0.1568627506494522], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.4745098054409027], [0.9960784316062927], [0.8117647171020508], [0.07058823853731155], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]]}]}' Spawn a single job that hits the KServe internal endpoint:
job.yaml1 apiVersion: batch/v1 2 kind: Job 3 metadata: 4-71 4 name: mnist-triton-load-test 5 spec: 6 parallelism: 1 7 template: 8 metadata: 9 annotations: 10 sidecar.istio.io/inject: "false" 11 spec: 12 tolerations: 13 - effect: NoSchedule 14 operator: Exists 15 key: node.kubernetes.io/unschedulable 16 affinity: 17 podAntiAffinity: 18 requiredDuringSchedulingIgnoredDuringExecution: 19 - labelSelector: 20 matchExpressions: 21 - key: serving.kubeflow.org/inferenceservice 22 operator: In 23 values: 24 - tf-triton-mnist 25 topologyKey: "kubernetes.io/hostname" 26 nodeSelector: 27 role: general-workers 28 containers: 29 - name: hey 30 image: gcr.io/arrikto-playground/hey:omri1 31 args: 32 - -t 33 - "0" 34 - -z 35 - 30s 36 - -c 37 - "50" # concurrency 38 - -m 39 - POST 40 - -D 41 - /app/mnist_input.json 42 # Kserve internal URL 43 - http://tf-triton-mnist.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer 44 # Triton 45 #- http://triton.kubeflow-user.svc.cluster.local/v2/models/tf-triton-mnist/infer 46 # Knative queue-proxy 47 #- http://triton.kubeflow-user.svc.cluster.local:8012/v2/models/tf-triton-mnist/infer 48 # Kserve external URL 49 #- -H 50 #- "Authorization: Bearer TOKEN" 51 #- -disable-redirects 52 #- https://tf-triton-mnist-kubeflow-user.serving.example.com/v2/models/tf-triton-mnist/infer 53 imagePullPolicy: Always 54 volumeMounts: 55 - name: input 56 mountPath: /app/mnist_input.json 57 subPath: mnist_input.json 58 # 1x50 uses 210% CPU 59 # 1x75 uses 283% CPU 60 # 1x100 uses 341% CPU 61 # 1x150 uses 445% CPU 62 resources: 63 limits: 64 cpu: 3 65 memory: 1Gi 66 requests: 67 cpu: 3 68 memory: 1Gi 69 restartPolicy: Never 70 volumes: 71 - name: input 72 configMap: 73 name: mnist-input 74 defaultMode: 511
Hey Usage¶
The job above uses the hey
client with the following options:
Specifically:
- -t 0 for infinite request timeout
- -z 60s for the duration of the job to send requests
- -c 50 for the number of workers to run concurrently
The above manifest also uses the following parameters that you can adjust according to your needs.
Concurrency¶
The job above uses concurrency 50. You can change that using the following:
If you want more concurrency, ensure that you adjust the resources requests and limits of the job since the client will need more CPU. 3 CPUs should be enough for concurrency up to 75.
Parallelism¶
Change the parallelism
field to use more concurrent jobs:
Note
To aggregate results, make sure that all of the Pods start at the same time, that is, none remains Pending due to lack of resources.
Prediction URL¶
To compare the performance overhead of each component you can change the inference URL to hit specific components directly.
Here are some examples you can use to hit components directly.
Hit the Triton server directly:
Hit the Knative queue-proxy directly:
Hit the KServe internal URL:
Hit the KServe external URL:
Replace <TOKEN>
with a short-lived token that you obtain following the
Access Services with External Clients guide.
See also
- Manage Networking to allow traffic originating from inside the cluster.
Output¶
After the job completes, inspect each Pod logs to see what hey
reports:
From the above output note:
- RPS (Requests/sec in
Summary
) - p95% LAT (95% in
Latency distribution
)
What’s Next¶
To learn more about Serving on EKF, check out the rest of our user guides for Serving.