Deploy Kiwi Components¶

This section will walk you through deploying the Kiwi components. More specifically, you will deploy the following:

Kiwi Device Plugin: Exposes Arrikto’s virtual GPU (vGPU) resources to users of the cluster.
Kiwi Scheduler: Manages a Kiwi-enabled NVIDIA GPU. The cluster runs a copy of the Kiwi Scheduler for each NVIDIA GPU that Kiwi manages.
Kiwi Admission Webhook: Adds a toleration to each Pod that requests a vGPU, ensuring it can run on GPU nodes.

Fast Forward

If you have already deployed the Kiwi components, expand this box to fast-forward.

Proceed to the Verify section.

Overview

What You’ll Need
Procedure
Verify
Summary
What’s Next

What You’ll Need ¶

A configured management environment.
An existing Kubernetes cluster with the NVIDIA device plugin DaemonSet deployed.
Kiwi components access to Arrikto’s private registry.

Procedure ¶

Go to your GitOps repository, inside your rok-tools management environment:

root@rok-tools:~# cd ~/ops/deployments
Edit rok/kiwi/overlays/deploy/patches/device-plugin.yaml and set the desired value for KIWI_SLOTS_PER_GPU:

env: - name: KIWI_SLOTS_PER_GPU value: "10" # <-- Edit this value

Note

The environment variable KIWI_SLOTS_PER_GPU defines how many arrikto.com/gpu resources the device plugin will create for each physical GPU it has consumed from its node. In simpler terms, this dictates how many Kubernetes Pods can concurrently run on the same physical GPU.

Important

Kiwi Device Plugin can currently only use one physical GPU per node.
Edit rok/kiwi/overlays/deploy/patches/scheduler.yaml, and set the desired values for KIWI_SCHEDULER_ON and KIWI_TQ:

env: - name: KIWI_SCHEDULER_ON value: "1" # <-- Edit this value - name: KIWI_TQ value: "30" # <-- Edit this value

Note

KIWI_SCHEDULER_ON defines the initial status for all Kiwi Scheduler instances. This variable accepts the values "1" and "0". A value of "1" means that the scheduler is initially on for all Kiwi-enabled nodes. A value of "0" means it is disabled. We recommend you keep the scheduler enabled, that is, use a value of "1". Check out how you can configure the Kiwi Scheduler.

KIWI_TQ defines the initial time quantum for all Kiwi Scheduler instances. This value represents the amount of seconds that the scheduler grants each client exclusive access to the GPU, in a round-robin manner. This matters only while the Kiwi Scheduler is on. Check out how you can configure the time quantum for the Kiwi Scheduler.

Important

Turning the Kiwi Scheduler off may cause thrashing and extreme performance degradation when the working set sizes of the collocated GPU applications do not fit in GPU memory.

Setting KIWI_TQ to a value smaller than five seconds may reduce performance significantly.

Note

To enable the debug logs for the Kiwi Scheduler, add the following snippet as well:

- name: KIWI_DEBUG value: "1"
Commit your changes:

root@rok-tools:~/ops/deployments# git commit -am "Deploy Kiwi Components"
Deploy the kiwi-system namespace:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi-namespaces/overlays/deploy
Deploy the Kiwi Device Plugin and the Kiwi Scheduler:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi/overlays/deploy
Deploy the Kiwi Admission Webhook:

root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi-webhook/overlays/deploy

Verify ¶

Verify that the Kiwi Device Plugin Pods are up and running.
1. Ensure that the kiwi-device-plugin DaemonSet exists:
  
  root@rok-tools:~# kubectl get -n kiwi-system daemonset kiwi-device-plugin \ > -o json >/dev/null 2>&1 \ > && echo OK || echo FAIL OK
2. Count the number of GPU-enabled nodes in your cluster:
  
  root@rok-tools:~# GPU_NODE_COUNT=$(kubectl get nodes -o json \ > | jq -r '[ .items[] | select(.status.allocatable["nvidia.com/gpu"] != null) ] | length') \ > && echo ${GPU_NODE_COUNT?} 10
3. Count the number of Arrikto vGPU-enabled nodes in your cluster:
  
  root@rok-tools:~# VGPU_NODE_COUNT=$(kubectl get nodes -o json \ > | jq -r '[ .items[] | select(.status.allocatable["arrikto.com/gpu"] != null) ] | length') \ > && echo ${VGPU_NODE_COUNT?} 10
4. Ensure that a Kiwi Device Plugin Pod has registered its Arrikto vGPU resources on every GPU-enabled node in your cluster:
  
  root@rok-tools:~# [[ ${VGPU_NODE_COUNT?} == ${GPU_NODE_COUNT?} ]] \ > && echo OK || echo FAIL OK
  Troubleshooting
  The output is FAIL
  
  This probably means that some or all GPU nodes do not have any available nvidia.com/gpu devices when deploying Kiwi Device Plugin, as other Pods are currently consuming them. Once a GPU becomes available on a node, it will be used by Kiwi Device Plugin. In this case, run the following steps:
  
  Check the number of Arrikto vGPU-enabled nodes:
  
  root@rok-tools:~# echo ${VGPU_NODE_COUNT?} 7
  
  Kiwi is fully functional for each of these nodes.
  
  Print a list of the GPU nodes on which Kiwi Device Plugin is not yet running:
  
  root@rok-tools:~# kubectl get nodes -o json \ > | jq -r '.items[] | select((.status.allocatable["nvidia.com/gpu"] != null) and (.status.allocatable["arrikto.com/gpu"] == null)) | .metadata.name' ip-192-168-109-143.eu-central-1.compute.internal ip-192-168-78-106.eu-central-1.compute.internal
  
  Optional
  
  For each node found in the previous step, identify Pods consuming nvidia.com/gpu resources.
  
  Set the name of the node on which Kiwi Device Plugin is not yet running:
  
  root@rok-tools:~# NODE_NAME=ip-192-168-109-143.eu-central-1.compute.internal
  
  Identify the Pods for this node that are consuming nvidia.com/gpu resources:
  
  root@rok-tools:~# kubectl get pods --all-namespaces -o json \ > --field-selector spec.nodeName=${NODE_NAME?} \ > | jq -r '.items[] | select(.status.phase == "Running") | select(any(.spec.containers[]; .resources.requests["nvidia.com/gpu"] != null)) | "\(.metadata.namespace)/\(.metadata.name)"'
  
  Note
  
  You can terminate these Pods if you do not need them and you want the Kiwi Device Plugin to start running on these nodes sooner.
Verify that the Kiwi Scheduler Pods are up and running. Check the DaemonSet status and verify that the value of field READY is equal to the value of field DESIRED for the DaemonSet:

root@rok-tools:~# kubectl get -n kiwi-system daemonset kiwi-scheduler NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kiwi-scheduler 2 2 2 2 2 <none> 5m
Verify that the Kiwi Webhook is properly deployed.
1. Ensure that the Deployment Pod is up and running. Verify that field READY is 1/1:
  
  root@rok-tools:~# kubectl get deploy -n kiwi-system kiwi-webhook NAME READY AGE kiwi-webhook 1/1 1m
2. Ensure that the Service exists:
  
  root@rok-tools:~# kubectl get service -n kiwi-system kiwi-webhook NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kiwi-webhook ClusterIP 10.100.64.46 <none> 443/TCP 22d
3. Ensure that the MutatingWebhookConfiguration exists:
  
  root@rok-tools:~# kubectl get mutatingwebhookconfiguration kiwi-webhook NAME WEBHOOKS AGE kiwi-webhook 1 22d
4. Ensure that the Certificate exists:
  
  root@rok-tools:~# kubectl get certificate -n kiwi-system kiwi-webhook-cert NAME READY SECRET AGE kiwi-webhook-cert True kiwi-webhook-certs 22d

Summary ¶

You have successfully deployed the Kiwi components to your cluster.

What’s Next ¶

The next step is to expose running services in your cluster to the outside world.

Expose EKF

Deploy Kiwi Components¶

What You’ll Need¶

Procedure¶

Verify¶

Summary¶

What’s Next¶