Scaling

Running EKF on hundreds of nodes enables your team to efficiently run distributed ML workflows at scale. However, scaling an EKF cluster to hundreds of nodes comes with multiple challenges, such as optimization of resource usage, elimination of bottlenecks and parallelization of operations, where possible.

This document provides useful information, metrics, and conclusions about scaling an EKF cluster to hundreds of nodes on AWS.

Upper Bound

Arrikto officially supports scaling a cluster up to a maximum supported node count of 100 nodes. Scaling up to 200 nodes will be supported upon special request by submitting a support ticket at https://support.arrikto.com. Going scaling beyond 200 nodes is not supported and violates Arrikto’s support terms and as such the cluster is considered “out of support” and support will be offered on a “best effort” basis.

Contact Arrikto

Scaling a cluster to more than 100 nodes can pose a number of Challenges. Before you attempt to scale your cluster to more than 100 nodes, plese contact the Arrikto Tech Team, so we can provide all required information and support.

At the time of writing we have tested scaling an EKF cluster up to 300 nodes using m5d.4xlarge instances on EKS. We performed the test in the Arrikto Labs by creating a machine learning pipeline. This pipeline forced the cluster to start scaling up from 3 nodes to 300. When the ML pipeline completed, the cluster scaled back down to 3 nodes.

Important

Scaling a cluster up to 300 nodes was an internal test performed by Arrikto for experimentation purposes. Arrikto officially supports scaling a cluster up to 100 nodes or 200 upon special request.

Time to Scale

Below you can view a chart that shows the time needed to scale an EKF cluster to 100, 200 and 300 nodes. For our measurements, other than Rok, we deployed the following:

  • A lightweight DaemonSet whose Pods become ready almost instantly after they start their execution. We use this as baseline to perform a side-by-side comparison with the time Rok and Rok CSI need to become ready, since these also run as DaemonSets.
  • A Jupyter Kale Deployment with hundreds of replicas that all mount and write to a single RWX Rok volume. We use this as the main workload that triggers a cluster scale up: each Jupyter Kale replica requests enough resources so that Cluster Autoscaler adds a new node for it.

Note

Depending on your cloud environment, networking setup, container registry, and type of workload you might observe different times than the ones shown above. For example, pulling a large container image for your workload from a distant container registry will add extra overhead.

From the chart above we draw the following conclusions:

  • Baseline becomes ready soon after all EKS cluster nodes become ready.
  • Rok needs between 2 and 10 additional minutes to become ready, depending on cluster size.
  • Rok CSI becomes ready less than 5 minutes after Rok is ready. This is expected, since Rok CSI depends on Rok.
  • The workload becomes ready approximately 5 minutes after Rok CSI. Again, this is expected, since Rok CSI needs to mount the RWX Rok volume on each of the nodes.
  • The workload becomes ready approximately 7 minutes after baseline for 100 nodes, 13 minutes after baseline for 200 nodes and 16 minutes after baseline for 300 nodes. This represents the initialization overhead due to all workload replicas using a single Rok RWX volume.

Challenges

Scaling to hundreds of nodes comes with multiple challenges, such as optimization of resource usage, elimination of bottlenecks and parallelization of operations. Below are some of the most important things that you should be aware of.

Rok etcd Volume Type

Inevitably, on a large-scale EKF cluster that runs hundreds of Rok instances, the traffic from and to Rok etcd becomes significantly higher. In this context, for clusters larger than 100 nodes, we highly recommend that you switch from a gp2 to a io1 volume for Rok etcd, since io1 volumes are designed for I/O-intensive workloads that are sensitive to storage consistency.

This is to maintain a high quality of service and improve overall stability.

See also

Rok Readiness

When scaling your EKF cluster to hundreds of nodes, Rok Pods will be created automatically on all eligible nodes. For each eligible node that becomes ready, Rok Operator adds exactly one new member to the Rok cluster. Since multiple cluster nodes become ready simultaneously, many Rok Pods start their initialization process and join the cluster in parallel.

This is known to cause the readiness probe of existing Rok pods to temporarily fail. Therefore, it is expected that you see Rok Pods transition from Ready to NotReady during scale up, until Rok and Rok CSI eventually become ready.

Resource Quotas

Depending on your AWS billing account, different limits on cloud resources might apply for your cluster. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that your account limits are above the sum of resources that you need to spawn new nodes.

Troubleshooting
vCPU Limit Exceeded

After requesting to scale your EKF cluster to hundreds of nodes on AWS, your node group might be in Degraded state due to the following error:

Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of N allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.

In this case, contact AWS support and request to increase the instance limits for your account. For more information on how to calculate the correct instance limits for your needs and request them from AWS see the official user guide for On-Demand Instance limits.

Cloud Provider Capacity

Depending on your region and availability zone, the capacity of your cloud provider might vary. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that your cloud provider has enough available resources to spawn new Kubernetes nodes for your cluster.

Troubleshooting
Insufficient Instance Capacity

After requesting to scale your EKF cluster to hundreds of nodes on AWS, your node group might be in Degraded state due to the following error:

Could not launch On-Demand Instances. InsufficientInstanceCapacity - We currently do not have sufficient m5d.4xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get m5d.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f. Launching EC2 instance failed.

In this case you can either:

  • Wait until AWS provisions additional capacity for your EKS cluster.
  • Contact AWS support to serve your request.
  • Add an additional node group to your cluster in a different Availability Zone.
Insufficient Free Addresses

After requesting to scale your EKF cluster to hundreds of nodes on AWS, your node group might be in Degraded state due to the following error:

Amazon AutoScaling was unable to launch instances because there are not enough free addresses in the subnet associated with your AutoScaling group(s).

In this case you need to free up enough addresses in the specified subnet, for example, by deleting EKS cluster node groups or EC2 instances you no longer need.

Docker Hub Rate Limiting

As stated in the official rate limiting documentation, Docker Hub limits the number of Docker image pulls based on the account type of the user pulling the image. Pull rates limits are based on individual IP address:

  • For anonymous users, the rate limit is set to 100 pulls per 6 hours per IP address.
  • For authenticated users, the rate limit is set to 200 pulls per 6 hours.
  • For users with a paid Docker subscription no rate limit applies.

By default, Istio (the service mesh of EKF) specifies container images hosted on Docker Hub for its proxy sidecar. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that the pull limits of the Docker Hub account you are using cover your needs.

Troubleshooting
Image Pull BackOff

When runnning workloads in an Istio-protected Kubernetes namespace on a multi hundred node EKF cluster, it could be the case that you exceed the maximum number of allowed pulls, which results in Pods getting in ImagePullBackOff phase due to the following error:

image "docker.io/istio/proxyv2:1.9.6": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

In this case, you can either

  • use a private container registry to host the Istio proxy container image and set up cluster-wide access to it or
  • upgrade to a paid Docker subscription so that you do not get limited.