Deploy Kiwi Components

This section will walk you through deploying the Kiwi components. More specifically, you will deploy the following:

  • Kiwi Device Plugin: Exposes Arrikto’s virtual GPU (vGPU) resources to users of the cluster.
  • Kiwi Scheduler: Manages a Kiwi-enabled NVIDIA GPU. The cluster runs a copy of the Kiwi Scheduler for each NVIDIA GPU that Kiwi manages.
  • Kiwi Admission Webhook: Adds a toleration to each Pod that requests a vGPU, ensuring it can run on GPU nodes.

Procedure

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. Edit rok/kiwi/overlays/deploy/patches/device-plugin.yaml and set the desired value for KIWI_SLOTS_PER_GPU:

    env: - name: KIWI_SLOTS_PER_GPU value: "10" # <-- Edit this value

    Note

    The environment variable KIWI_SLOTS_PER_GPU defines how many arrikto.com/gpu resources the device plugin will create for each physical GPU it has consumed from its node. In simpler terms, this dictates how many Kubernetes Pods can concurrently run on the same physical GPU.

    Important

    Kiwi Device Plugin can currently only use one physical GPU per node.

  3. Edit rok/kiwi/overlays/deploy/patches/scheduler.yaml, and set the desired values for KIWI_SCHEDULER_ON and KIWI_TQ:

    env: - name: KIWI_SCHEDULER_ON value: "1" # <-- Edit this value - name: KIWI_TQ value: "30" # <-- Edit this value

    Note

    KIWI_SCHEDULER_ON defines the initial status for all Kiwi Scheduler instances. This variable accepts the values "1" and "0". A value of "1" means that the scheduler is initially on for all Kiwi-enabled nodes. A value of "0" means it is disabled. We recommend you keep the scheduler enabled, that is, use a value of "1". Check out how you can configure the Kiwi Scheduler.

    KIWI_TQ defines the initial time quantum for all Kiwi Scheduler instances. This value represents the amount of seconds that the scheduler grants each client exclusive access to the GPU, in a round-robin manner. This matters only while the Kiwi Scheduler is on. Check out how you can configure the time quantum for the Kiwi Scheduler.

    Important

    Turning the Kiwi Scheduler off may cause thrashing and extreme performance degradation when the working set sizes of the collocated GPU applications do not fit in GPU memory.

    Setting KIWI_TQ to a value smaller than five seconds may reduce performance significantly.

    Note

    To enable the debug logs for the Kiwi Scheduler, add the following snippet as well:

    - name: KIWI_DEBUG value: "1"
  4. Commit your changes:

    root@rok-tools:~/ops/deployments# git commit -am "Deploy Kiwi Components"
  5. Deploy the kiwi-system namespace:

    root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi-namespaces/overlays/deploy
  6. Deploy the Kiwi Device Plugin and the Kiwi Scheduler:

    root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi/overlays/deploy
  7. Deploy the Kiwi Admission Webhook:

    root@rok-tools:~/ops/deployments# rok-deploy --apply rok/kiwi-webhook/overlays/deploy

Verify

  1. Verify that the Kiwi Device Plugin Pods are up and running.

    1. Ensure that the kiwi-device-plugin DaemonSet exists:

      root@rok-tools:~# kubectl get -n kiwi-system daemonset kiwi-device-plugin \ > -o json >/dev/null 2>&1 \ > && echo OK || echo FAIL OK
    2. Count the number of GPU-enabled nodes in your cluster:

      root@rok-tools:~# GPU_NODE_COUNT=$(kubectl get nodes -o json \ > | jq -r '[ .items[] | select(.status.allocatable["nvidia.com/gpu"] != null) ] | length') \ > && echo ${GPU_NODE_COUNT?} 10
    3. Count the number of Arrikto vGPU-enabled nodes in your cluster:

      root@rok-tools:~# VGPU_NODE_COUNT=$(kubectl get nodes -o json \ > | jq -r '[ .items[] | select(.status.allocatable["arrikto.com/gpu"] != null) ] | length') \ > && echo ${VGPU_NODE_COUNT?} 10
    4. Ensure that a Kiwi Device Plugin Pod has registered its Arrikto vGPU resources on every GPU-enabled node in your cluster:

      root@rok-tools:~# [[ ${VGPU_NODE_COUNT?} == ${GPU_NODE_COUNT?} ]] \ > && echo OK || echo FAIL OK

      Troubleshooting

      The output is FAIL

      This probably means that some or all GPU nodes do not have any available nvidia.com/gpu devices when deploying Kiwi Device Plugin, as other Pods are currently consuming them. Once a GPU becomes available on a node, it will be used by Kiwi Device Plugin. In this case, run the following steps:

      1. Check the number of Arrikto vGPU-enabled nodes:

        root@rok-tools:~# echo ${VGPU_NODE_COUNT?} 7

        Kiwi is fully functional for each of these nodes.

      2. Print a list of the GPU nodes on which Kiwi Device Plugin is not yet running:

        root@rok-tools:~# kubectl get nodes -o json \ > | jq -r '.items[] | select((.status.allocatable["nvidia.com/gpu"] != null) and (.status.allocatable["arrikto.com/gpu"] == null)) | .metadata.name' ip-192-168-109-143.eu-central-1.compute.internal ip-192-168-78-106.eu-central-1.compute.internal
      3. Optional

        For each node found in the previous step, identify Pods consuming nvidia.com/gpu resources.

        1. Set the name of the node on which Kiwi Device Plugin is not yet running:

          root@rok-tools:~# NODE_NAME=ip-192-168-109-143.eu-central-1.compute.internal
        2. Identify the Pods for this node that are consuming nvidia.com/gpu resources:

        root@rok-tools:~# kubectl get pods --all-namespaces -o json \ > --field-selector spec.nodeName=${NODE_NAME?} \ > | jq -r '.items[] | select(.status.phase == "Running") | select(any(.spec.containers[]; .resources.requests["nvidia.com/gpu"] != null)) | "\(.metadata.namespace)/\(.metadata.name)"'

        Note

        You can terminate these Pods if you do not need them and you want the Kiwi Device Plugin to start running on these nodes sooner.

  2. Verify that the Kiwi Scheduler Pods are up and running. Check the DaemonSet status and verify that the value of field READY is equal to the value of field DESIRED for the DaemonSet:

    root@rok-tools:~# kubectl get -n kiwi-system daemonset kiwi-scheduler NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kiwi-scheduler 2 2 2 2 2 <none> 5m
  3. Verify that the Kiwi Webhook is properly deployed.

    1. Ensure that the Deployment Pod is up and running. Verify that field READY is 1/1:

      root@rok-tools:~# kubectl get deploy -n kiwi-system kiwi-webhook NAME READY AGE kiwi-webhook 1/1 1m
    2. Ensure that the Service exists:

      root@rok-tools:~# kubectl get service -n kiwi-system kiwi-webhook NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kiwi-webhook ClusterIP 10.100.64.46 <none> 443/TCP 22d
    3. Ensure that the MutatingWebhookConfiguration exists:

      root@rok-tools:~# kubectl get mutatingwebhookconfiguration kiwi-webhook NAME WEBHOOKS AGE kiwi-webhook 1 22d
    4. Ensure that the Certificate exists:

      root@rok-tools:~# kubectl get certificate -n kiwi-system kiwi-webhook-cert NAME READY SECRET AGE kiwi-webhook-cert True kiwi-webhook-certs 22d

Summary

You have successfully deployed the Kiwi components to your cluster.

What’s Next

The next step is to expose running services in your cluster to the outside world.