Currently, Kubernetes assigns GPUs exclusively to Pods. This is especially inefficient in interactive scenarios, such as a development using a Jupyter notebook server, in which a Pod, or application, has large idle periods.

Kiwi is a GPU sharing mechanism that enables multiple containers (belonging to the same or different Pods) to run on the same GPU concurrently, each one having the whole GPU memory available for use. It achieves this by transparently paging out the GPU memory of idle processes, using the system RAM as swap space.


This feature is a Tech Preview, so it is not fully supported by Arrikto and may not be functionally complete. While it is not intended for production use, we encourage you to try it out and provide us feedback.

Read our Tech Preview Policy for more information.

Also, check out the Kiwi User Guide to find out more on how to use vGPUs.

Schedule Arrikto vGPUs on Kubernetes

The Kiwi Device Plugin advertizes multiple Arrikto virtual GPUs (vGPUs) per Kiwi-enabled GPU it manages. As such, the Kubernetes scheduler can now assign multiple Pods (that request an device) to the same physical GPU. Here’s a simple visualization:

Kiwi Scheduler


Each instance of the Kiwi Scheduler manages a single Kiwi-enabled NVIDIA GPU. When we refer to it in the singular, we simply refer to an arbitrary instance of it.

When the combined GPU memory usage of the collocated applications fits in GPU memory, then they can run in parallel without any intervention.

However, when the combined memory usage exceeds the total GPU memory, Kiwi must enforce serialization of GPU work among the applications in order to avoid thrashing. Thrashing is a situation in which time spent handling page faults overwhelms time spent doing useful computations.

Kiwi offers an anti-thrashing mechanism via the Kiwi Scheduler. The Kiwi Scheduler assigns exclusive usage of the whole GPU to a single application at a time, rotating between competing applications in a round-robin fashion. Each application can use the GPU for a time quantum (TQ seconds).


The Kiwi Scheduler is not related to the Kubernetes scheduler in any way. The Kiwi Scheduler manages one Kiwi-enabled physical NVIDIA GPU within a single node. It “schedules” exclusive access to the GPU for each time quantum (TQ).


You can configure the Kiwi Scheduler’s TQ. See the related Operations guide.


By default, the Kiwi Scheduler is enabled, meaning that anti-thrashing is enabled. If you disable it without ensuring that the working sets (GPU memory) of collocated applications fit in GPU memory, you can cause thrashing and, hence, severe performance degradation.

See how you can enable or disable the Kiwi Scheduler.

Example Timeline of Kiwi Applications

Let’s examine a graph that shows the execution timeline of two different applications using vGPUs and running on the same physical GPU. We start examining their behavior at an arbitrary point in time, T0. Let’s assume both of these two applications are Jupyter notebooks on which an ML engineer is experimenting.


We assume that the Kiwi Scheduler is enabled, therefore when GPU bursts overlap, the scheduler serializes work on the physical GPU, giving exclusive access to one application at a time.

Let’s examine what happens at each point in time:

  1. T0-T1:

    • Application A is doing CPU work, for example data preprocessing.
    • Application B is idle. The developer might be tweaking their code or taking a break.
  2. T1:

    At point T1, application A starts running a cell that does GPU computations. It requests the GPU from the Kiwi Scheduler, and since no other application is using it at the moment, the scheduler immediately grants it access for TQ seconds.

  3. T1-T2:

    • Application A runs GPU code.
    • Application B runs CPU code.
  4. T2:

    Application B wants to run GPU code, so it requests access from the Kiwi Scheduler. However, the scheduler has currently given access to A, so B has to wait for the TQ to elapse or for application A to release the GPU early if its GPU burst is shorter than TQ seconds.

  5. T2-T3:

    • Application A runs GPU code.
    • Application B waits for the GPU.
  6. T3 (T1 + TQ):

    The time quantum (TQ) elapses.

    • Application A relinquishes the GPU. Since it still wants to run GPU work, it requests it from the Kiwi Scheduler and enters a waiting state.
    • The Kiwi Scheduler gives access to application B.
  7. T3-T4:

    • Applications A waits for the GPU.
    • Application B runs GPU code.
  8. T4 (< T3 + TQ):

    • Application B no longer needs to run GPU work. Since it did not need the whole TQ to finish its GPU burst, it relinquishes the GPU early.
    • The Kiwi Scheduler gives exclusive access to application A once more.
  9. T > T4:

    There are no more overlapping GPU bursts, so the applications do not have to wait for access to the GPU. They run both their CPU and GPU parts unhindered.