Rok Disk Manager

When you install Arrikto Enterprise Kubeflow, you also deploy Rok Disk Manager (RDM), a component that runs on all nodes of your Kubernetes cluster and prepares the underlying storage for Rok.

This guide describes the behavior of Rok Disk Manager based on its default configuration for each one of the supported cloud platforms (AWS, Azure, Google Cloud). This guide also contains commands that you can run to inspect the state of the storage resources that Rok Disk Manager creates in your Kubernetes cluster and, thus, gain a deeper understanding of how it operates internally.

Here is what you will need to follow them:

Contact Arrikto

Making changes to the default configuration of RDM is an advanced operation that will affect your cluster data. If, for any reason you wish to modify the default configuration of RDM, you should first coordinate with Arrikto to do so.

Introduction

Rok Disk Manager runs as a DaemonSet on each one of your cluster nodes, decides on which disks to manage, and configures them for Rok. In doing so, RDM periodically applies a Python-like script that contains a declarative disk configuration.

For each one of the supported cloud platforms, RDM runs a slightly different script that depends on the underlying infrastructure.

Selecting Disks

Rok Disk Manager consults the disk management script to select which of the available disks it will manage on every node. Rok will exclusively use these disks to provision volumes and take snapshots on Kubernetes.

Thereby, RDM retrieves both

  • fast, ephemeral disks that are bound to the node (for example, local NVMe SSDs), and
  • slower, persistent ones that do not depend on the lifetime of the node (for example, EBS volumes, Google Persistent Disks, etc.).

Note

Working with both ephemeral and persistent disks is a prerequisite for Rok to run on heterogeneous clusters, that is, clusters whose nodes have disks of different type attached to them.

It is important to note that RDM does not manage disks that Rok should not use at all, that is, disks that contain critical system data. These disks are:

  • The root disk, that contains the filesystem of the node.
  • Any disks attached to the node as a result of volume provisioning on Kubernetes. For example, requesting a PersistentVolumeClaim with the default storage class of the cloud platform leads to the creation of a PersistentVolume that is backed by a persistent disk (for example, EBS volume, Azure data disk, Google Persistent Disk etc.). Ultimately, this persistent disk appears at a well-known location inside the filesystem of the node.

The default RDM configuration comes with different disk requirements and decisions for each one of the supported cloud platforms, due to the heterogeneity of the underlying infrastructure. Choose one of the following options to inspect the requirements and decisions for your preferred platform:

  • Local NVMe SSDs for Rok must appear under /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage* on each cluster node that supports local NVMe storage.
  • Extra EBS volumes for Rok must appear under /dev/sd[f-p] on each cluster node. The admin of the EKS cluster must follow the official Amazon recommendations on naming storage devices when they add extra EBS volumes for Rok.
  • Local NVMe SSDs for Rok must appear under /dev/disk/by-id/nvme-Microsoft_NVMe_Direct_Disk* on each cluster node that supports local NVMe storage.
  • Extra data disks for Rok must appear under /dev/disk/azure/scsi[0-3]/lun6[0-3] on each cluster node. The admin of the AKS cluster must configure Azure to assign extra data disks to well known SCSI controllers with IDs 0-3 and add them with LUNs 60-63 on each cluster node.
  • One or more local NVMe SSDs for Rok must appear under /dev/disk/by-id/google-local-nvme-ssd-* or /dev/disk/by-id/google-local-ssd-* on each cluster node. GKE supports adding local SSDs on all instance types. The admin of the GKE cluster must specify the interface (NVMe or SCSI) of the local SSD, based on performance requirements.
  • RDM will ignore all Persistent Disks that are attached to each cluster node, that is, RDM will not manage and Rok will not use any Persistent Disk.

See also

Step-by-Step Analysis

In this section we will go through the disk management script that RDM applies in chunks, by grouping semantically related commands together. We will explain the rationale behind each group and provide you with commands to view the state of the storage resources that RDM creates in each of your cluster nodes.

Note

Follow Along: The easiest way to inspect storage resources that RDM creates and manages in every node of your Kubernetes cluster is to start from a rok-tools management environment and exec into a running RDM Pod:

root@rok-tools:~# kubectl exec -ti -n rok-system ds/rok-disk-manager -- /bin/bash

The disk management script that RDM applies in your cluster performs the following core operations that are common across all supported cloud platforms:

  1. Select Disks for Rok
  2. Assemble RAID Array
  3. Allocate Rok Snapshot Space
  4. Format Rok Snapshot Space

Note

Follow Along: Here is how you can retrieve the disk management script that RDM currently applies in your Kubernetes cluster:

  1. Inspect the ConfigMap to retrieve the Rok Disk Manager script:

    root@rok-tools:~# kubectl get cm -n rok-system disk-script -o jsonpath="{.data.disk-script}" nvme = get_disks(devices="/dev/disk/by-id/google-local-nvme-ssd-*"); scsi = get_disks(devices="/dev/disk/by-id/google-local-ssd-*"); md = raid("/dev/md/rok-disk-manager:rok", bdevs=nvme + scsi, level=0); rokpv = pv(md); rokvg = vg("rokvg", pvs=rokpv); fiskslv_size = min(200 * GiB, 0.3 * rokvg.size); fiskslv = lv(rokvg, "rok-fisks", size=fiskslv_size); filesystem = fs(fiskslv, "ext4"); mountpoint = mount(filesystem, "/mnt/data", persistent=False); _ = dir("/mnt/data/rok");

Below, you can view the default disk management script for each one of the supported cloud platforms:

ssds = get_disks(devices="/dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage*"); ebs = get_disks(devices="/dev/sd[f-p]"); md0 = raid("/dev/md0", bdevs=ssds + ebs, level=0); rokpv = pv(md0); rokvg = vg("rokvg", pvs=rokpv); fiskslv_size = min(200 * GiB, 0.3 * rokvg.size); fiskslv = lv(rokvg, "rok-fisks", size=fiskslv_size); filesystem = fs(fiskslv, "ext4"); mountpoint = mount(filesystem, "/mnt/data", persistent=False); _ = dir("/mnt/data/rok");
ssds = get_disks(devices="/dev/disk/by-id/nvme-Microsoft_NVMe_Direct_Disk*"); data_disks = get_disks(devices="/dev/disk/azure/scsi[0-3]/lun6[0-3]"); md = raid("/dev/md/rok-disk-manager:rok", bdevs=ssds + data_disks, level=0); rokpv = pv(md); rokvg = vg("rokvg", pvs=rokpv); fiskslv_size = min(200 * GiB, 0.3 * rokvg.size); fiskslv = lv(rokvg, "rok-fisks", size=fiskslv_size); filesystem = fs(fiskslv, "ext4"); mountpoint = mount(filesystem, "/mnt/data", persistent=False); _ = dir("/mnt/data/rok");
nvme = get_disks(devices="/dev/disk/by-id/google-local-nvme-ssd-*"); scsi = get_disks(devices="/dev/disk/by-id/google-local-ssd-*"); md = raid("/dev/md/rok-disk-manager:rok", bdevs=nvme + scsi, level=0); rokpv = pv(md); rokvg = vg("rokvg", pvs=rokpv); fiskslv_size = min(200 * GiB, 0.3 * rokvg.size); fiskslv = lv(rokvg, "rok-fisks", size=fiskslv_size); filesystem = fs(fiskslv, "ext4"); mountpoint = mount(filesystem, "/mnt/data", persistent=False); _ = dir("/mnt/data/rok");

Select Disks for Rok

Rok Disk Manager uses pattern matching to decide on which disks it will manage on each node, given the disk requirements that we described above. Below, you can view the exact pattern that RDM uses for each one of the supported cloud platforms:

Note

To discover EBS volumes, RDM searches under /dev/sd[f-p]. To discover local NVMe SSDs in a consistent manner, RDM always works with persistent disk identifiers that Amazon creates under /dev/disk/by-id/.

  • Rok uses all local NVMe SSDs on the node by default:

    ssds = get_disks(devices="/dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage*");
  • Rok uses all persistent disks (EBS volumes) under /dev/sd[f-p] on the node by default:

    ebs = get_disks(devices="/dev/sd[f-p]");

Note

Follow Along: Let’s assume an EKS cluster with m5d.4xlarge instances, each having 2 x 300 GB local NVMe SSD. Here is how you can verify that 2 x 279.4 GiB local NVMe SSD are attached to your EKS cluster node:

  1. List all local NVMe SSDs:

    root@rok-disk-manager-nxp2v:/# ls -lah /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage* lrwxrwxrwx 1 root root 13 Dec 7 12:31 /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1B36EF6A69359BFC0 -> ../../nvme2n1 lrwxrwxrwx 1 root root 13 Dec 7 12:31 /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1B36EF6A69359BFC0-ns-1 -> ../../nvme2n1 lrwxrwxrwx 1 root root 13 Dec 7 12:31 /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS275C52DAB795EAA4A -> ../../nvme1n1 lrwxrwxrwx 1 root root 13 Dec 7 12:31 /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS275C52DAB795EAA4A-ns-1 -> ../../nvme1n1
  2. List all block devices:

    root@rok-disk-manager-nxp2v:/# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:0 0 279.4G 0 disk ... nvme2n1 259:1 0 279.4G 0 disk ...

Note

To distinguish between Azure data disks and local SSDs in a consistent manner, RDM always works with persistent disk identifiers that Microsoft creates under /dev/disk/by-id/.

  • Rok uses all local NVMe SSDs on the node by default:

    ssds = get_disks(devices="/dev/disk/by-id/nvme-Microsoft_NVMe_Direct_Disk*");
  • Rok uses all persistent disks (Azure data disks) with LUNs 60-63 under all well-known SCSI controllers on the node by default:

    data_disks = get_disks(devices="/dev/disk/azure/scsi[0-3]/lun6[0-3]");

Note

Follow Along: Let’s assume an AKS cluster with Standard_L8s_v2 instances, each having 1 x 100 GiB data disk and no local NVMe SSDs. Here is how you can verify that 1 x 100 GiB data disk is attached to your AKS cluster node:

  1. List all persistent disks with LUNs 60-63 under all well-known SCSI controllers:

    root@rok-disk-manager-nxp2v:/# ls -lah /dev/disk/azure/scsi*/lun6* lrwxrwxrwx 1 root root 12 Dec 15 11:07 /dev/disk/azure/scsi1/lun63 -> ../../../sda
  2. List all block devices:

    root@rok-disk-manager-vv2t4:/# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 100G 0 disk ...

Note

To distinguish between NVMe and SCSI local SSDs in a consistent manner, RDM always works with persistent disk identifiers that Google creates under /dev/disk/by-id/.

  • Rok uses all local NVMe SSDs on the node by default:

    nvme = get_disks(devices="/dev/disk/by-id/google-local-nvme-ssd-*");
  • Rok uses all local SCSI SSDs on the node by default:

    scsi = get_disks(devices="/dev/disk/by-id/google-local-ssd-*");

Note

Follow Along: Let’s assume a GKE cluster with n1-standard-8 instances, each having 1 x 375 GiB local NVMe SSD. Here is how you can verify that 1 x 375 GiB local NVMe SSD is attached to your GKE cluster node:

  1. List all local NVMe SSDs:

    root@rok-disk-manager-265n2:/# ls -lah /dev/disk/by-id/google-local-nvme-ssd-* lrwxrwxrwx 1 root root 13 Dec 14 11:09 /dev/disk/by-id/google-local-nvme-ssd-0 -> ../../nvme0n1
  2. List all block devices:

    root@rok-disk-manager-265n2:/# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT ... nvme0n1 259:0 0 375G 0 disk

Note

lsblk expresses the size of devices in Gibibytes (GiB).

Assemble RAID Array

Rok Disk Manager assembles the previously selected extra disks for Rok into a RAID0 (data stripping) array to boost performance.

Important

RDM requires that the extra disks for Rok are of the same size. Using disks of unequal size to assemble the RAID array will cause errors or result in a waste of storage space.

Choose one of the following options to inspect this configuration on your preferred cloud platform:

md0 = raid("/dev/md0", bdevs=ssds + ebs, level=0);

Note

Follow Along: Let’s, again, assume an EKS cluster with m5d.4xlarge instances, each having 2 x 300 GB local NVMe SSD. Here is how you can verify that a RAID0 device with a size of 558.6 GiB appears at /dev/md0 in your EKS cluster node:

  1. List the /dev/md0 block device and verify that its type is raid0:

    root@rok-disk-manager-nxp2v:/# lsblk /dev/md0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md0 9:0 0 558.6G 0 raid0 ...
md = raid("/dev/md/rok-disk-manager:rok", bdevs=ssds + data_disks, level=0);

Note

Follow Along: Let’s, again, assume an AKS cluster with Standard_L8s_v2 instances, each having 1 x 100 GiB data disk, that is, no local NVMe SSD. Here is how you can verify that a RAID0 device with a size of 100 GiB appears at /dev/md/rok-disk-manager:rok and points to an underlying md* device in your AKS cluster node:

  1. List the /dev/md/rok-disk-manager\:rok block device and verify that its type is raid0:

    root@rok-disk-manager-vv2t4:/# lsblk /dev/md/rok-disk-manager\:rok NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md127 9:127 0 100G 0 raid0 ...
md = raid("/dev/md/rok-disk-manager:rok", bdevs=nvme + scsi, level=0);

Note

Follow Along: Let’s, again, assume a GKE cluster with n1-standard-8 instances, each having 1 x 375 GiB local NVMe SSD. Here is how you can verify that a RAID0 device with a size of 374.9 GiB appears at /dev/md/rok-disk-manager:rok and points to an underlying md* device in your GKE cluster node:

  1. List the /dev/md/rok-disk-manager:rok block device and verify that its type is raid0:

    root@rok-disk-manager-265n2:/# lsblk /dev/md/rok-disk-manager:rok NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md127 9:127 0 374.9G 0 raid0 ...

Allocate Rok Snapshot Space

When taking snapshots of PersistentVolumeClaims, Rok needs some space to maintain transient data before uploading snapshots to the object storage service. The size of this storage space depends on the snapshot frequency and whether the disk to be snapshotted is mostly read or written.

Important

RDM preallocates space for Rok to store transient snapshot data. This space takes up part of the total storage available on each node which means that only the remaining space is left as raw storage for Rok to provision volumes on Kubernetes.

Also, when a Rok snapshot operation is active, Rok allocates additional space from the total available storage to store live snapshot data. The size of this space defaults to 10 GiB and is immediately reclaimed once the Rok snapshot operation finishes.

RDM leverages the Logical Volume Manager (LVM) framework to create and manage logical volume entities atop the previously assembled RAID0 array, via a volume group. Choose one of the following options to inspect this configuration on your preferred cloud platform:

rokpv = pv(md0); rokvg = vg("rokvg", pvs=rokpv);
rokpv = pv(md); rokvg = vg("rokvg", pvs=rokpv);
rokpv = pv(md); rokvg = vg("rokvg", pvs=rokpv);

RDM has to determine a proper size for the Rok snapshot space based on the characteristics of each environment, that is, the number of extra disks for Rok, the size of extra disks for Rok, the total amount of storage needed by running applications, etc. Therefore, RDM first uses a heuristic to calculate the size of the logical volume that will serve as the Rok snapshot space and, then, it creates the logical volume under the existing volume group:

fiskslv_size = min(200 * GiB, 0.3 * rokvg.size); fiskslv = lv(rokvg, "rok-fisks", size=fiskslv_size);

Choose one of the following options to inspect this configuration on your preferred cloud platform:

Note

Follow Along: Let’s, again, assume an EKS cluster with m5d.4xlarge instances, each having 2 x 300 GB local NVMe SSD. Here is how you can verify that a logical volume for transient Rok snapshot data exists in your EKS cluster node and that its size is 167.6 GiB:

  1. List the /dev/md0 block device and verify that a device of type lvm exists under it, mounted under /mnt/data:

    root@rok-disk-manager-nxp2v:/# lsblk /dev/md0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md0 9:0 0 558.6G 0 raid0 `-rokvg-rok--fisks 253:0 0 167.6G 0 lvm /mnt/data

Note

Follow Along: Let’s, again, assume an AKS cluster with Standard_L8s_v2 instances, each having 1 x 100 GiB data disk, that is, no local NVMe SSD. Here is how you can verify that a logical volume for transient Rok snapshot data exists in your AKS cluster node and that its size is 30 GiB:

  1. List the /dev/md/rok-disk-manager:rok block device and verify that a device of type lvm exists under it, mounted under /mnt/data:

    root@rok-disk-manager-vv2t4:/# lsblk /dev/md/rok-disk-manager:rok NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md127 9:127 0 100G 0 raid0 `-rokvg-rok--fisks 253:0 0 30G 0 lvm /mnt/data

Note

Follow Along: Let’s, again, assume a GKE cluster with n1-standard-8 instances, each having 1 x 375 GiB local NVMe SSD. Here is how you can verify that a logical volume for transient Rok snapshot data exists in your GKE cluster node and that its size is 112.5 GiB:

  1. List the /dev/md/rok-disk-manager:rok block device and verify that a device of type lvm exists under it, mounted under /mnt/data:

    root@rok-disk-manager-265n2:/# lsblk /dev/md/rok-disk-manager:rok NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT md127 9:127 0 374.9G 0 raid0 `-rokvg-rok--fisks 253:0 0 112.5G 0 lvm /mnt/data

Format Rok Snapshot Space

After creating the necessary logical volume entities, Rok Disk Manager needs to make the space allocated for transient snapshot data available to Rok, that is, format the previously created logical volume and mount it to the location where Rok is configured to find it.

In this regard, RDM chooses to format the logical volume using ext4 as the filesystem and mount it under /mnt/data. Finally, it creates the /mnt/data/rok/ subdirectory, which is the default data path the Rok file daemon uses.

filesystem = fs(fiskslv, "ext4"); mountpoint = mount(filesystem, "/mnt/data", persistent=False); _ = dir("/mnt/data/rok");

Note

Follow Along: Here is how you can verify that Rok is able to access the space allocated for transient snapshot data under /mnt/data and that the Rok file daemon has successfully adopted the /mnt/data/rok/ subdirectory:

  1. Verify that the logical volume mounted under /mnt/data is properly formatted:

    root@rok-disk-manager-nxp2v:/# ls -lah /mnt/data/ total 24K drwxr-xr-x 4 root root 4.0K Dec 7 12:31 . drwxr-xr-x 3 root root 18 Dec 7 12:31 .. drwx------ 2 root root 16K Dec 7 12:31 lost+found drwxr-xr-x 3 root root 4.0K Dec 7 12:35 rok
  2. Verify that the /mnt/data/rok subdirectory exists and that Rok has successfully adopted it:

    root@rok-disk-manager-nxp2v:/# ls -lah /mnt/data/rok/ total 12K drwxr-xr-x 3 root root 4.0K Dec 7 12:35 . drwxr-xr-x 4 root root 4.0K Dec 7 12:31 .. -rwxr-xr-x 1 root root 0 Dec 7 12:35 .APP_FISKS drwxr-xr-x 5 root root 4.0K Dec 7 12:35 filed

Summary

In this guide you gained insight on how Rok Disk Manager works on different cloud platforms, how it prepares disks for Rok and how to inspect the underlying storage resources that RDM creates in every node of your Kubernetes cluster.

What’s Next

To learn more about EKF and its components, check out the rest of our user guides.