Configure Serving for Better Performance¶

This guide provides insight on how to change the default setup and configure serving components for better performance.

Overview

Resources
Knative
- Activator
- Autoscaling
AuthService Caching
Summary
What’s Next

Resources ¶

When you specify a Pod, you can also specify how much of each resource a container needs. When you specify the resource request for containers in a Pod, the scheduler uses it to decide which node to place the Pod on. When you specify a resource limit for a container, the running container is not allowed to use more of that resource than the limit you set. In case of CPU bound workloads, you ideally need to ensure guaranteed Quality of Service (QoS), that is, have the resource request equal to its limit. Low request values of the CPU resource will result in

overcommitted nodes due to bad scheduling, and
CPU starvation because Pods with more requests will get more CPU time.

See also

How Kubernetes applies resource requests and limits.

Important

The numbers used below are just examples. You may need to adjust them to suit your needs. See our Performance Evaluation to find out how these numbers affect the performance.

istio-proxy Sidecar ¶

By default, istio-proxy will run with the following resources:

resources:
  limits:
    cpu: "2"
    memory: 1Gi
  requests:
    cpu: 10m
    memory: 40Mi

Use the following annotations in your InferenceService resource to override the default configuration:

sidecar.istio.io/proxyCPU: 2000m
sidecar.istio.io/proxyCPULimit: 2000m
sidecar.istio.io/proxyMemory: 40Mi
sidecar.istio.io/proxyMemoryLimit: 1Gi

Even if you set a higher CPU limit, the upper limit of the CPU usage will be 200% because Istio will run envoy with --concurrency 2. To override this, use the following annotation in your InferenceService resource:

proxy.istio.io/config: |
  concurrency: 0

By using the above annotation, along with sidecar.istio.io/proxyCPULimit: 4000m, for example, the istio-proxy sidecar ends up without --concurrency. However, the envoy process inside the istio-proxy sidecar ends up running with --concurrency 4.

Note

The envoy proxy inside the Istio sidecar will end up spawning 2 x concurrency + 10 threads. The envoy proxy inside the IGW that runs without --concurrency will end up taking all the available CPUs and spawn 2 x CPU + 10 threads.

See also

GitHub issue regarding self-contradicting documentation for concurrency.

queue-proxy Sidecar ¶

By default, the queue-proxy sidecar will use the following resources:

resources:
  requests:
    cpu: 25m

Based on the above findings, beside the fact that this is misleading for scheduling, it may also starve on a node with much CPU contention.

If you use the queue.sidecar.serving.knative.dev/resourcePercentage annotation in your InferenceService resource, you will end up with an 500m CPU limit, which turns out to be a hardcoded maximum:

resources:
   limits:
     cpu: 500m
     memory: 500Mi
   requests:
     cpu: 100m
     memory: 200Mi

To override this, configure this resource globally in the config-deployment ConfigMap by setting:

queueSidecarCPULimit: 6000m
queueSidecarCPURequest: 6000m

See also

queue.sidecar.serving.knative.dev/resourcePercentage annotation doesn’t work as expected.

Istio Ingress Gateway ¶

There are two Istio IGWs involved for serving:

knative-serving-ingressgateway for the external URL
knative-serving-cluster-ingressgateway for the internal URL

Both are Deployments in the knative-serving namespace.

By default both come with the following resources:

resources:
  limits:
    cpu: 2000m
    memory: 1024Mi
  requests:
    cpu: 100m
    memory: 128Mi

First, make sure that these deployments have proper resources and ideally are guaranteed:

resources:
  limits:
    cpu: "8"
    memory: 1Gi
  requests:
    cpu: "8"
    memory: 128Mi

See also

Performance Evaluation of the IGW.

Install Kubernetes Metrics Server and then create HorizontalPodAutoscaler resources for the two deployments to autoscale based on their CPU load:

hpa.yaml

1apiVersion: autoscaling/v2beta1
2kind: HorizontalPodAutoscaler
3metadata:
4-46
labels:
  app: knative-serving-cluster-ingressgateway
  install.operator.istio.io/owning-resource: unknown
  istio: knative-serving-cluster-ingressgateway
  istio.io/rev: default
  operator.istio.io/component: IngressGateways
  release: istio
name: knative-serving-cluster-ingressgateway
namespace: knative-serving
13spec:
maxReplicas: 10
metrics:
- resource:
    name: cpu
    targetAverageUtilization: 90
  type: Resource
minReplicas: 1
scaleTargetRef:
  apiVersion: apps/v1
  kind: Deployment
  name: knative-serving-cluster-ingressgateway
25---
26apiVersion: autoscaling/v2beta1
27kind: HorizontalPodAutoscaler
28metadata:
labels:
  app: knative-serving-ingressgateway
  install.operator.istio.io/owning-resource: unknown
  istio: knative-serving-ingressgateway
  istio.io/rev: default
  operator.istio.io/component: IngressGateways
  release: istio
name: knative-serving-ingressgateway
namespace: knative-serving
38spec:
maxReplicas: 10
metrics:
- resource:
    name: cpu
    targetAverageUtilization: 90
  type: Resource
minReplicas: 1
scaleTargetRef:
  apiVersion: apps/v1
  kind: Deployment
  name: knative-serving-ingressgateway

Update the deployment with the following podAntiAffinity so that you won’t run more than one replica on each node:

spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - knative-serving-cluster-ingressgateway
              topologyKey: "kubernetes.io/hostname"

See also

Knative ¶

This section provides insight on how to configure the activator to be in the request path only when your model is scaled to zero and how you can configure autoscaling so that your model replicas will scale up and down based on the load.

Activator ¶

The activator is responsible for receiving and buffering requests for inactive revisions, and reporting metrics to the Autoscaler. It also retries requests to a revision after the Autoscaler scales the revision based on the reported metrics.

The activator may be in the path, depending on the revision scale and load, and on the given target burst capacity. By default this is set to 200 in the config-autoscaler ConfigMap:

target-burst-capacity: "200"
container-concurrency-target-percentage: "70"

The following comments in the ConfigMap provide information on whether the activator is in the request path based on the target-burst-capacity value:

# If target-burst-capacity is 0, then Activator will be in the request path only
# when the revision is scaled to 0.
# If target-burst-capacity is > 0 and container-concurrency-target-percentage is
# 100%, then the activator will always be in the request path.
# -1 denotes unlimited target-burst-capacity and activator will always be in the
# request path.

Based on the above, in order to have the activator in the request path only when scaled to zero, override the default with a per-inference service annotation:

autoscaling.knative.dev/targetBurstCapacity: "0"

Note

Removing activator from the path, you cannot impose a hard concurrency. This is the default behavior, that is, there is no limit on the number of requests that are allowed to flow into the revision.

Check if the activator is in the path by inspecting the endpoints that Knative creates. For example:

$ kubectl get endpoints -n kubeflow-user
...
tf-triton-mnist-predictor-default-00001           192.168.166.11:8012                                                          77s
tf-triton-mnist-predictor-default-00001-private   192.168.141.243:9091,192.168.186.159:9091,192.168.141.243:8012 + 7 more...   77s
triton                                            192.168.141.243:9091,192.168.186.159:9091,192.168.141.243:8012 + 3 more...   54d

In the example above, you can see that the public predictor endpoint points to a different address than the private one.

Check if the public predictor endpoint points to the activator:

$ kubectl get endpoints -n kubeflow-user tf-triton-mnist-predictor-default-00001 -o yaml
...
subsets:
- addresses:
  - ip: 192.168.166.11
    nodeName: ip-192-168-163-133.eu-central-1.compute.internal
    targetRef:
      kind: Pod
      name: activator-d4f945ff4-7qzd7
      namespace: knative-serving
      resourceVersion: "33458254"
      uid: eca4343b-72f2-4598-ae3e-c9c437255666
  ports:
  - name: http
    port: 8012
    protocol: TCP

Check if the private predictor endpoint points the predictor Pods:

$ kubectl get endpoints -n kubeflow-user tf-triton-mnist-predictor-default-00001-private -o yaml
...
subsets:
- addresses:
  - ip: 192.168.141.243
    nodeName: ip-192-168-172-121.eu-central-1.compute.internal
    targetRef:
      kind: Pod
      name: tf-triton-mnist-predictor-default-00001-deployment-65b9bccrsr9j
      namespace: kubeflow-user
      resourceVersion: "33554913"
      uid: 327b8291-07b2-4a0c-b084-c3a9eef79a31

See also

Activator component.

Autoscaling ¶

To configure autoscaling for your inference service, you must set the soft limit for the concurrency, that is, how many concurrent requests a serving replica can take. These metrics are reported by the queue-proxy sidecar, and the Autoscaler will act accordingly regardless if the activator is in the path.

For example, if you want to get a new replica for every 10 concurrent requests, set the following per-inference service annotation:

autoscaling.knative.dev/target: 10

By default Knative will scale down “unneeded” replicas pretty fast because of the following defaults in config-autoscaler ConfigMap:

scale-to-zero-grace-period: "30s"
scale-down-delay: "0s"

See also

AuthService Caching ¶

As shown in Serving Architecture, in case of the External URL, AuthService will be in the path and check if every request is authenticated. In the case of service account token authentication, it will hit the Kubernetes API which imposes significant latency. EKF 1.5 supports authentication caching that you can enable by following the Enable AuthService Caching Mechanism guide.

Summary ¶

In this guide you gained insight on how to configure serving for better performance.

What’s Next ¶

The next step is to measure Serving performance.

Performance Evaluation