Configure Serving for Better Performance¶
This guide provides insight on how to change the default setup and configure serving components for better performance.
When you specify a Pod, you can also specify how much of each resource a
container needs. When you specify the resource
request for containers in a
Pod, the scheduler uses it to decide which node to place the Pod on. When you
specify a resource
limit for a container, the running container is not
allowed to use more of that resource than the limit you set. In case of CPU
bound workloads, you ideally need to ensure guaranteed Quality of Service (QoS),
that is, have the resource
request equal to its
values of the CPU resource will result in
- overcommitted nodes due to bad scheduling, and
- CPU starvation because Pods with more requests will get more CPU time.
The numbers used below are just examples. You may need to adjust them to suit your needs. See our Performance Evaluation to find out how these numbers affect the performance.
By default, istio-proxy will run with the following resources:
Use the following annotations in your InferenceService resource to override the default configuration:
Even if you set a higher CPU limit, the upper limit of the CPU usage will be
200% because Istio will run envoy with
--concurrency 2. To override this,
use the following annotation in your InferenceService resource:
By using the above annotation, along with
4000m, for example, the istio-proxy sidecar ends up without
--concurrency. However, the envoy process inside the istio-proxy sidecar
ends up running with
The envoy proxy inside the Istio sidecar will end up spawning 2 x
concurrency + 10 threads. The envoy proxy inside the IGW that runs without
--concurrency will end up taking all the available CPUs and spawn 2 x
CPU + 10 threads.
By default, the queue-proxy sidecar will use the following resources:
Based on the above findings, beside the fact that this is misleading for scheduling, it may also starve on a node with much CPU contention.
If you use the
annotation in your InferenceService resource, you will end up with an 500m CPU
limit, which turns out to be a hardcoded maximum:
To override this, configure this resource globally in the
ConfigMap by setting:
queue.sidecar.serving.knative.dev/resourcePercentageannotation doesn’t work as expected.
There are two Istio IGWs involved for serving:
knative-serving-ingressgatewayfor the external URL
knative-serving-cluster-ingressgatewayfor the internal URL
Both are Deployments in the
By default both come with the following resources:
First, make sure that these deployments have proper resources and ideally are guaranteed:
Install Kubernetes Metrics Server and then create HorizontalPodAutoscaler resources for the two deployments to autoscale based on their CPU load:
Update the deployment with the following
podAntiAffinity so that you won’t
run more than one replica on each node:
This section provides insight on how to configure the activator to be in the request path only when your model is scaled to zero and how you can configure autoscaling so that your model replicas will scale up and down based on the load.
The activator is responsible for receiving and buffering requests for inactive revisions, and reporting metrics to the Autoscaler. It also retries requests to a revision after the Autoscaler scales the revision based on the reported metrics.
The activator may be in the path, depending on the revision scale and load, and
on the given target burst capacity. By default this is set to 200 in the
The following comments in the ConfigMap provide information on whether the
activator is in the request path based on the
Based on the above, in order to have the activator in the request path only when scaled to zero, override the default with a per-inference service annotation:
Removing activator from the path, you cannot impose a hard concurrency. This is the default behavior, that is, there is no limit on the number of requests that are allowed to flow into the revision.
Check if the activator is in the path by inspecting the endpoints that Knative creates. For example:
In the example above, you can see that the public predictor endpoint points to a different address than the private one.
Check if the public predictor endpoint points to the activator:
Check if the private predictor endpoint points the predictor Pods:
To configure autoscaling for your inference service, you must set the soft limit for the concurrency, that is, how many concurrent requests a serving replica can take. These metrics are reported by the queue-proxy sidecar, and the Autoscaler will act accordingly regardless if the activator is in the path.
For example, if you want to get a new replica for every 10 concurrent requests, set the following per-inference service annotation:
By default Knative will scale down “unneeded” replicas pretty fast because of
the following defaults in
As shown in Serving Architecture, in case of the External URL, AuthService will be in the path and check if every request is authenticated. In the case of service account token authentication, it will hit the Kubernetes API which imposes significant latency. EKF 1.5 supports authentication caching that you can enable by following the Enable AuthService Caching Mechanism guide.