Deploy Kiwi Components¶
This section will walk you through deploying the Kiwi components. More specifically, you will deploy the following:
- Kiwi Device Plugin: Exposes Arrikto’s virtual GPU (vGPU) resources to users of the cluster.
- Kiwi Scheduler: Manages a Kiwi-enabled NVIDIA GPU. The cluster runs a copy of the Kiwi Scheduler for each NVIDIA GPU that Kiwi manages.
- Kiwi Admission Webhook: Adds a toleration to each Pod that requests a vGPU, ensuring it can run on GPU nodes.
Fast Forward
If you have already deployed the Kiwi components, expand this box to fast-forward.
- Proceed to the Verify section.
Overview
What You’ll Need¶
- A configured management environment.
- An existing Kubernetes cluster with the NVIDIA device plugin DaemonSet deployed.
- Kiwi components access to Arrikto’s private registry.
Procedure¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsEdit
rok/kiwi/overlays/deploy/patches/device-plugin.yaml
and set the desired value forKIWI_SLOTS_PER_GPU
:env: - name: KIWI_SLOTS_PER_GPU value: "10" # <-- Edit this valueNote
The environment variable
KIWI_SLOTS_PER_GPU
defines how manyarrikto.com/gpu
resources the device plugin will create for each physical GPU it has consumed from its node. In simpler terms, this dictates how many Kubernetes Pods can concurrently run on the same physical GPU.Important
Kiwi Device Plugin can currently only use one physical GPU per node.
Edit
rok/kiwi/overlays/deploy/patches/scheduler.yaml
, and set the desired values forKIWI_SCHEDULER_ON
andKIWI_TQ
:env: - name: KIWI_SCHEDULER_ON value: "1" # <-- Edit this value - name: KIWI_TQ value: "30" # <-- Edit this valueNote
KIWI_SCHEDULER_ON
defines the initial status for all Kiwi Scheduler instances. This variable accepts the values"1"
and"0"
. A value of"1"
means that the scheduler is initially on for all Kiwi-enabled nodes. A value of"0"
means it is disabled. We recommend you keep the scheduler enabled, that is, use a value of"1"
. Check out how you can configure the Kiwi Scheduler.KIWI_TQ
defines the initial time quantum for all Kiwi Scheduler instances. This value represents the amount of seconds that the scheduler grants each client exclusive access to the GPU, in a round-robin manner. This matters only while the Kiwi Scheduler is on. Check out how you can configure the time quantum for the Kiwi Scheduler.Important
Turning the Kiwi Scheduler off may cause thrashing and extreme performance degradation when the working set sizes of the collocated GPU applications do not fit in GPU memory.
Setting
KIWI_TQ
to a value smaller than five seconds may reduce performance significantly.Note
To enable the debug logs for the Kiwi Scheduler, add the following snippet as well:
- name: KIWI_DEBUG value: "1"Commit your changes:
root@rok-tools:~/ops/deployments# git commit -am "Deploy Kiwi Components"Deploy the
kiwi-system
namespace:root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi-namespaces/overlays/deployDeploy the Kiwi Device Plugin and the Kiwi Scheduler:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi/overlays/deployDeploy the Kiwi Admission Webhook:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi-webhook/overlays/deploy
Verify¶
Verify that the Kiwi Device Plugin Pods are up and running.
Ensure that the
kiwi-device-plugin
DaemonSet exists:root@rok-tools:~# kubectl get -n kiwi-system daemonset kiwi-device-plugin \ > -o json >/dev/null 2>&1 \ > && echo OK || echo FAIL OKCount the number of GPU-enabled nodes in your cluster:
root@rok-tools:~# GPU_NODE_COUNT=$(kubectl get nodes -o json \ > | jq -r '[ .items[] | select(.status.allocatable["nvidia.com/gpu"] != null) ] | length') \ > && echo ${GPU_NODE_COUNT?} 10Count the number of Arrikto vGPU-enabled nodes in your cluster:
root@rok-tools:~# VGPU_NODE_COUNT=$(kubectl get nodes -o json \ > | jq -r '[ .items[] | select(.status.allocatable["arrikto.com/gpu"] != null) ] | length') \ > && echo ${VGPU_NODE_COUNT?} 10Ensure that a Kiwi Device Plugin Pod has registered its Arrikto vGPU resources on every GPU-enabled node in your cluster:
root@rok-tools:~# [[ ${VGPU_NODE_COUNT?} == ${GPU_NODE_COUNT?} ]] \ > && echo OK || echo FAIL OKTroubleshooting
The output is FAIL
This probably means that some or all GPU nodes do not have any available
nvidia.com/gpu
devices when deploying Kiwi Device Plugin, as other Pods are currently consuming them. Once a GPU becomes available on a node, it will be used by Kiwi Device Plugin. In this case, run the following steps:Check the number of Arrikto vGPU-enabled nodes:
root@rok-tools:~# echo ${VGPU_NODE_COUNT?} 7Kiwi is fully functional for each of these nodes.
Print a list of the GPU nodes on which Kiwi Device Plugin is not yet running:
root@rok-tools:~# kubectl get nodes -o json \ > | jq -r '.items[] | select((.status.allocatable["nvidia.com/gpu"] != null) and (.status.allocatable["arrikto.com/gpu"] == null)) | .metadata.name' ip-192-168-109-143.eu-central-1.compute.internal ip-192-168-78-106.eu-central-1.compute.internalOptional
For each node found in the previous step, identify Pods consuming
nvidia.com/gpu
resources.Set the name of the node on which Kiwi Device Plugin is not yet running:
root@rok-tools:~# NODE_NAME=ip-192-168-109-143.eu-central-1.compute.internalIdentify the Pods for this node that are consuming
nvidia.com/gpu
resources:
root@rok-tools:~# kubectl get pods --all-namespaces -o json \ > --field-selector spec.nodeName=${NODE_NAME?} \ > | jq -r '.items[] | select(.status.phase == "Running") | select(any(.spec.containers[]; .resources.requests["nvidia.com/gpu"] != null)) | "\(.metadata.namespace)/\(.metadata.name)"'Note
You can terminate these Pods if you do not need them and you want the Kiwi Device Plugin to start running on these nodes sooner.
Verify that the Kiwi Scheduler Pods are up and running. Check the DaemonSet status and verify that the value of field READY is equal to the value of field DESIRED for the DaemonSet:
root@rok-tools:~# kubectl get -n kiwi-system daemonset kiwi-scheduler NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kiwi-scheduler 2 2 2 2 2 <none> 5mVerify that the Kiwi Webhook is properly deployed.
Ensure that the
Deployment
Pod is up and running. Verify that field READY is 1/1:root@rok-tools:~# kubectl get deploy -n kiwi-system kiwi-webhook NAME READY AGE kiwi-webhook 1/1 1mEnsure that the
Service
exists:root@rok-tools:~# kubectl get service -n kiwi-system kiwi-webhook NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kiwi-webhook ClusterIP 10.100.64.46 <none> 443/TCP 22dEnsure that the
MutatingWebhookConfiguration
exists:root@rok-tools:~# kubectl get mutatingwebhookconfiguration kiwi-webhook NAME WEBHOOKS AGE kiwi-webhook 1 22dEnsure that the
Certificate
exists:root@rok-tools:~# kubectl get certificate -n kiwi-system kiwi-webhook-cert NAME READY SECRET AGE kiwi-webhook-cert True kiwi-webhook-certs 22d
What’s Next¶
The next step is to expose running services in your cluster to the outside world.