Configure Serving for Better Performance

This guide provides insight on how to change the default setup and configure serving components for better performance.


When you specify a Pod, you can also specify how much of each resource a container needs. When you specify the resource request for containers in a Pod, the scheduler uses it to decide which node to place the Pod on. When you specify a resource limit for a container, the running container is not allowed to use more of that resource than the limit you set. In case of CPU bound workloads, you ideally need to ensure guaranteed Quality of Service (QoS), that is, have the resource request equal to its limit. Low request values of the CPU resource will result in

  • overcommitted nodes due to bad scheduling, and
  • CPU starvation because Pods with more requests will get more CPU time.


The numbers used below are just examples. You may need to adjust them to suit your needs. See our Performance Evaluation to find out how these numbers affect the performance.

istio-proxy Sidecar

By default, istio-proxy will run with the following resources:

resources: limits: cpu: "2" memory: 1Gi requests: cpu: 10m memory: 40Mi

Use the following annotations in your InferenceService resource to override the default configuration: 2000m 2000m 40Mi 1Gi

Even if you set a higher CPU limit, the upper limit of the CPU usage will be 200% because Istio will run envoy with --concurrency 2. To override this, use the following annotation in your InferenceService resource: | concurrency: 0

By using the above annotation, along with 4000m, for example, the istio-proxy sidecar ends up without --concurrency. However, the envoy process inside the istio-proxy sidecar ends up running with --concurrency 4.


The envoy proxy inside the Istio sidecar will end up spawning 2 x concurrency + 10 threads. The envoy proxy inside the IGW that runs without --concurrency will end up taking all the available CPUs and spawn 2 x CPU + 10 threads.

queue-proxy Sidecar

By default, the queue-proxy sidecar will use the following resources:

resources: requests: cpu: 25m

Based on the above findings, beside the fact that this is misleading for scheduling, it may also starve on a node with much CPU contention.

If you use the annotation in your InferenceService resource, you will end up with an 500m CPU limit, which turns out to be a hardcoded maximum:

resources: limits: cpu: 500m memory: 500Mi requests: cpu: 100m memory: 200Mi

To override this, configure this resource globally in the config-deployment ConfigMap by setting:

queueSidecarCPULimit: 6000m queueSidecarCPURequest: 6000m

See also

Istio Ingress Gateway

There are two Istio IGWs involved for serving:

  • knative-serving-ingressgateway for the external URL
  • knative-serving-cluster-ingressgateway for the internal URL

Both are Deployments in the knative-serving namespace.

By default both come with the following resources:

resources: limits: cpu: 2000m memory: 1024Mi requests: cpu: 100m memory: 128Mi

First, make sure that these deployments have proper resources and ideally are guaranteed:

resources: limits: cpu: "8" memory: 1Gi requests: cpu: "8" memory: 128Mi

Install Kubernetes Metrics Server and then create HorizontalPodAutoscaler resources for the two deployments to autoscale based on their CPU load:

1apiVersion: autoscaling/v2beta1
2kind: HorizontalPodAutoscaler
4 labels:
5 app: knative-serving-cluster-ingressgateway
6 unknown
7 istio: knative-serving-cluster-ingressgateway
8 default
9 IngressGateways
10 release: istio
11 name: knative-serving-cluster-ingressgateway
12 namespace: knative-serving
14 maxReplicas: 10
15 metrics:
16 - resource:
17 name: cpu
18 targetAverageUtilization: 90
19 type: Resource
20 minReplicas: 1
21 scaleTargetRef:
22 apiVersion: apps/v1
23 kind: Deployment
24 name: knative-serving-cluster-ingressgateway
26apiVersion: autoscaling/v2beta1
27kind: HorizontalPodAutoscaler
29 labels:
30 app: knative-serving-ingressgateway
31 unknown
32 istio: knative-serving-ingressgateway
33 default
34 IngressGateways
35 release: istio
36 name: knative-serving-ingressgateway
37 namespace: knative-serving
39 maxReplicas: 10
40 metrics:
41 - resource:
42 name: cpu
43 targetAverageUtilization: 90
44 type: Resource
45 minReplicas: 1
46 scaleTargetRef:
47 apiVersion: apps/v1
48 kind: Deployment
49 name: knative-serving-ingressgateway

Update the deployment with the following podAntiAffinity so that you won’t run more than one replica on each node:

spec: template: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - knative-serving-cluster-ingressgateway topologyKey: ""


This section provides insight on how to configure the activator to be in the request path only when your model is scaled to zero and how you can configure autoscaling so that your model replicas will scale up and down based on the load.


The activator is responsible for receiving and buffering requests for inactive revisions, and reporting metrics to the Autoscaler. It also retries requests to a revision after the Autoscaler scales the revision based on the reported metrics.

The activator may be in the path, depending on the revision scale and load, and on the given target burst capacity. By default this is set to 200 in the config-autoscaler ConfigMap:

target-burst-capacity: "200" container-concurrency-target-percentage: "70"

The following comments in the ConfigMap provide information on whether the activator is in the request path based on the target-burst-capacity value:

# If target-burst-capacity is 0, then Activator will be in the request path only # when the revision is scaled to 0. # If target-burst-capacity is > 0 and container-concurrency-target-percentage is # 100%, then the activator will always be in the request path. # -1 denotes unlimited target-burst-capacity and activator will always be in the # request path.

Based on the above, in order to have the activator in the request path only when scaled to zero, override the default with a per-inference service annotation: "0"


Removing activator from the path, you cannot impose a hard concurrency. This is the default behavior, that is, there is no limit on the number of requests that are allowed to flow into the revision.

Check if the activator is in the path by inspecting the endpoints that Knative creates. For example:

$ kubectl get endpoints -n kubeflow-user ... tf-triton-mnist-predictor-default-00001 77s tf-triton-mnist-predictor-default-00001-private,, + 7 more... 77s triton,, + 3 more... 54d

In the example above, you can see that the public predictor endpoint points to a different address than the private one.

Check if the public predictor endpoint points to the activator:

$ kubectl get endpoints -n kubeflow-user tf-triton-mnist-predictor-default-00001 -o yaml ... subsets: - addresses: - ip: nodeName: targetRef: kind: Pod name: activator-d4f945ff4-7qzd7 namespace: knative-serving resourceVersion: "33458254" uid: eca4343b-72f2-4598-ae3e-c9c437255666 ports: - name: http port: 8012 protocol: TCP

Check if the private predictor endpoint points the predictor Pods:

$ kubectl get endpoints -n kubeflow-user tf-triton-mnist-predictor-default-00001-private -o yaml ... subsets: - addresses: - ip: nodeName: targetRef: kind: Pod name: tf-triton-mnist-predictor-default-00001-deployment-65b9bccrsr9j namespace: kubeflow-user resourceVersion: "33554913" uid: 327b8291-07b2-4a0c-b084-c3a9eef79a31


To configure autoscaling for your inference service, you must set the soft limit for the concurrency, that is, how many concurrent requests a serving replica can take. These metrics are reported by the queue-proxy sidecar, and the Autoscaler will act accordingly regardless if the activator is in the path.

For example, if you want to get a new replica for every 10 concurrent requests, set the following per-inference service annotation: 10

By default Knative will scale down “unneeded” replicas pretty fast because of the following defaults in config-autoscaler ConfigMap:

scale-to-zero-grace-period: "30s" scale-down-delay: "0s"

AuthService Caching

As shown in Serving Architecture, in case of the External URL, AuthService will be in the path and check if every request is authenticated. In the case of service account token authentication, it will hit the Kubernetes API which imposes significant latency. EKF 1.5 supports authentication caching that you can enable by following the Enable AuthService Caching Mechanism guide.


In this guide you gained insight on how to configure serving for better performance.

What’s Next

The next step is to measure Serving performance.