Recover when Rok Runs Out of Snapshot Space

In order to take snapshots of PersistentVolumeClaims, Rok needs dedicated space for storing transient data before uploading snapshots to the object storage service.

This transient data consists of the disk blocks that have changed since the previous snapshot of the volume.

The size of this snapshot space determines the maximum amount of changed volume data Rok can snapshot.

This means that if, for example, the available Rok Snapshot Space is 200GiB, and you write 250GiB between two consecutive snapshots of a volume, then you won’t be able to snapshot the volume, unless you increase the size of the Rok Snapshot Space by changing the Disk Setup Script of Rok Disk Manager.

But, currently, any changes to the Disk Setup Script of Rok Disk Manager will only affect new nodes, not existing ones.

This guide will walk you through temporarily increasing the available Rok Snapshot Space of an existing node, to recover from situations where you have written more data than Rok can snapshot.

To achieve this you will create a volume, for example, an EBS volume, with size twice the size of the node’s Rok storage space, attach it to the node and use it to extend the available Rok Snapshot Space.

Important

Note that this is a one-time-only procedure and you have to drain and remove the node afterwards. If you fail to remove the node, Rok will be unusable on it after a node reboot, since Rok cannot currently handle the extra disk you will use to extend the Rok Volume Group.

Check Your Environment

  1. Get the PVC that has failed. Choose one of the following options, based on whether a VolumeSnapshot or the unpinning of a PVC has failed:

    1. Specify the name of the VolumeSnapshot:

      root@rok-tools:~# export VS=<VS_NAME>

      Replace <VS_NAME> with the name of the VolumeSnapshot, for example:

      root@rok-tools:~# export VS=test-notebook-workspace-jxjpm-snap
    2. Specify the namespace of the VolumeSnapshot:

      root@rok-tools:~# export NAMESPACE=<VS_NAMESPACE>

      Replace <VS_NAMESPACE> with the namespace of the VolumeSnapshot, for example:

      root@rok-tools:~# export NAMESPACE=kubeflow-user
    3. Get the PVC that is used as the source of the VolumeSnapshot:

      root@rok-tools:~# export PVC=$(kubectl get volumesnapshot ${VS:?} \ > -n ${NAMESPACE:?} -o jsonpath={.spec.source.persistentVolumeClaimName})
    1. Specify the name of the PVC:

      root@rok-tools:~# export PVC=<PVC_NAME>

      Replace <PVC_NAME> with the name of the PVC, for example:

      root@rok-tools:~# export PVC=test-notebook-workspace-jxjpm
    2. Specify the namespace of the PVC:

      root@rok-tools:~# export NAMESPACE=<PVC_NAMESPACE>

      Replace <PVC_NAMESPACE> with the namespace of the PVC, for example:

      root@rok-tools:~# export NAMESPACE=kubeflow-user
  2. Get the PV that is bound to the PVC:

    root@rok-tools:~/ops/deployments# export PV=$(kubectl get pvc \ > -n ${NAMESPACE:?} ${PVC:?} -o jsonpath={.spec.volumeName})
  3. Find the node where the volume lives:

    root@rok-tools:~# export NODE=$(kubectl get pv ${PV:?} -o json \ > | jq -r '.spec.nodeAffinity.required.nodeSelectorTerms[]?.matchExpressions[].values[]')
  4. Retrieve the name of the Rok Disk Manager Pod running on the node:

    root@rok-tools:~# export RDM=$(kubectl get pod -n rok-system \ > --field-selector spec.nodeName==${NODE:?} -l name==rok-disk-manager \ > -o custom-columns=NAME:.metadata.name --no-headers)
  5. Get the total size of the Rok Snapshots filesystem and note it down, as you are going to need it later:

    root@rok-tools:~# kubectl exec -n rok-system ${RDM:?} -- df -h /mnt/data/rok/ --output=file,size File Size /mnt/data/rok/ 164G
  6. Describe the events on the corresponding resource. Choose one of the following options, based on whether a VolumeSnapshot or the unpinning of a PVC has failed:

    Describe the VolumeSnapshot. If the node doesn’t have enough free snapshot space to snapshot the volume, you will see events like the following. You should see events of type Warning with reason JobFailed containing SCSI command ...  (errno: No space left on device (28)) in their message:

    root@rok-tools:~# kubectl describe volumesnapshot -n ${NAMESPACE:?} ${VS:?} Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatingSnapshot 10m snapshot-controller Waiting for a snapshot kubeflow-user/test-notebook-workspace-jxjpm-snap to be created by the CSI driver. Normal JobStarted 10m rok-csi Starting job `493cb10c-6110-46b4-8bb8-cd017bce9f14' (worker `ip-192-168-148-206.eu-central-1.compute.internal')... Normal PROGRESS 10m rok-csi [ 1%] Retrieving volume metadata... Taking reference to volume's internal snapshot... Suspending volume... Normal SnapshotCreated 10m snapshot-controller Snapshot kubeflow-user/test-notebook-workspace-jxjpm-snap was successfully created by the CSI driver. Normal INFO 10m rok-csi Performing pre-snapshot verification for device `/dev/mapper/roklvm-ad183a25-6220-43be-876c-53070c82f093-era'... Successfully completed pre-snapshot verification Normal PROGRESS 10m rok-csi [ 12%] Taking data snapshot... Taking dm-era metadata snapshot... Normal INFO 10m rok-csi Performing post-snapshot verification of snapshot device `/dev/mapper/roklvm-c1fcff07-8459-4f25-a48c-d0930b0c5047-data-snap'... Successfully completed post-snapshot verification of snapshot device `/dev/mapper/roklvm-c1fcff07-8459-4f25-a48c-d0930b0c5047-data-snap' Normal PROGRESS 10m rok-csi [ 31%] Uploading snapshot... Retrieving volume metadata... Cloning old snapshot... Retrieving list of changed blocks... Dropping metadata snapshot... Filtering out unused FS blocks... Normal INFO 10m rok-csi Found 51200 blocks of size 4MiB (200GiB) reported as changed by the underlying storage device Found 52424704 blocks of size 4KiB (200GiB) in use by the filesystem Normal PROGRESS 9m59s rok-csi [ 31%] About to copy 52424704 changed blocks of size 4KiB (200GiB)... Normal PROGRESS 8m42s rok-csi [ 34%] Copying 52424704 changed blocks of size 4KiB (200GiB)... (20%) Normal PROGRESS 7m19s rok-csi [ 38%] Copying 52424704 changed blocks of size 4KiB (200GiB)... (40%) Normal PROGRESS 5m56s rok-csi [ 42%] Copying 52424704 changed blocks of size 4KiB (200GiB)... (60%) Normal PROGRESS 4m34s rok-csi [ 45%] Copying 52424704 changed blocks of size 4KiB (200GiB)... (80%) Warning JobFailed 4m26s rok-csi Job Failed: SCSI command 0x8A(SCSI_OPCODE_WRITE16) failed with status `SCSI_STATUS_CHECK_CONDITION' and KCQ `SCSI_KCQ_LOGICAL_UNIT_NOT_READY_SPACE_ALLOCATION_IN_PROGRESS' (errno: No space left on device (28)): Run `kubectl logs -n rok rok-csi-node-ndlvl -c csi-node' for more information

    Describe the PVC. If the node doesn’t have enough free snapshot space to snapshot the volume, you will see events like the following. You should see events of type Warning with reason JobFailed containing SCSI command ...  (errno: No space left on device (28)) in their message:

    root@rok-tools:~# kubectl describe pvc -n ${NAMESPACE:?} ${PVC:?} Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal INFO 51m rok-csi Marking PVC 'kubeflow-user/test-notebook-workspace-jxjpm' (PV 'pvc-894803e1-c992-4c05-be25-f79e7f9d8873'), bound on cordoned node 'ip-192-168-185-67.eu-central-1.compute.internal', for unpinning (unpin reason: 'NodeCordoned')... Normal INFO 51m rok-csi Unpinning PVC `test-notebook-workspace-jxjpm' (PV `pvc-894803e1-c992-4c05-be25-f79e7f9d8873') from node `ip-192-168-185-67.eu-central-1.compute.internal'... Normal JobStarted 51m rok-csi Starting job `26143f83-791b-4496-a180-c6c279184657' (worker `ip-192-168-185-67.eu-central-1.compute.internal')... Normal PROGRESS 51m rok-csi [ 1%] Retrieving volume metadata... Taking reference to volume's internal snapshot... Suspending volume... Normal INFO 51m rok-csi PV `pvc-894803e1-c992-4c05-be25-f79e7f9d8873' already marked for unpinning Normal INFO 51m rok-csi Performing pre-snapshot verification for device `/dev/mapper/roklvm-de6adfc5-be4a-4d45-938d-17deb81f4b7c-era'... Successfully completed pre-snapshot verification Normal PROGRESS 51m rok-csi [ 12%] Taking data snapshot... Taking dm-era metadata snapshot... Normal INFO 51m rok-csi Performing post-snapshot verification of snapshot device `/dev/mapper/roklvm-27bbcd91-4ad3-43b8-932b-e7e36b39337b-data-snap'... Successfully completed post-snapshot verification of snapshot device `/dev/mapper/roklvm-27bbcd91-4ad3-43-932b-e7e36b39337b-data-snap' Normal PROGRESS 51m rok-csi [ 31%] Uploading snapshot... Retrieving volume metadata... Cloning old snapshot... Retrieving list of changed blocks... Dropping metadata snapshot... Filtering out unused FS blocks... Normal INFO 51m rok-csi Found 52836 blocks of size 4MiB (206GiB) reported as changed by the underlying storage device Found 53941051 blocks of size 4KiB (206GiB) in use by the filesystem Normal PROGRESS 51m rok-csi [ 31%] About to copy 53941051 changed blocks of size 4KiB (206GiB)... Normal PROGRESS 50m rok-csi [ 34%] Copying 53941051 changed blocks of size 4KiB (206GiB)... (20%) Normal PROGRESS 48m rok-csi [ 38%] Copying 53941051 changed blocks of size 4KiB (206GiB)... (40%) Normal PROGRESS 47m rok-csi [ 42%] Copying 53941051 changed blocks of size 4KiB (206GiB)... (60%) Warning JobFailed 45m rok-csi Job Failed: SCSI command 0x8A(SCSI_OPCODE_WRITE16) failed with status `SCSI_STATUS_CHECK_CONDITION' and KCQ `SCSI_KCQ_LOGICAL_UNIT_NOT_READY_SPACE_ALLOCATION_IN_PROGRESS' (errno: No space left on device (28)): Run `kubectl logs -n rok rok-csi-node-xrdw6 -c csi-node' for more information ... Normal INFO 32s rok-csi Marking PVC 'kubeflow-user/test-notebook-workspace-jxjpm' (PV 'pvc-894803e1-c992-4c05-be25-f79e7f9d8873'), bound on cordoned node 'ip-192-168-185-67.eu-central-1.compute.internal', for unpinning (unpin reason: 'NodeCordoned')... Normal JobStarted 32s rok-csi Starting job `8aca7bd2-3f70-4563-8a74-84dddf3647ce' (worker `ip-192-168-185-67.eu-central-1.compute.internal')... Normal INFO 32s rok-csi Unpinning PVC `test-notebook-workspace-jxjpm' (PV `pvc-894803e1-c992-4c05-be25-f79e7f9d8873') from node `ip-192-168-185-67.eu-central-1.compute.internal'... Normal INFO 31s rok-csi PV `pvc-894803e1-c992-4c05-be25-f79e7f9d8873' already marked for unpinning Normal PROGRESS 31s rok-csi [ 1%] Retrieving volume metadata... Taking reference to volume's internal snapshot... Suspending volume... Normal INFO 30s rok-csi Performing pre-snapshot verification for device `/dev/mapper/roklvm-de6adfc5-be4a-4d45-938d-17deb81f4b7c-era'... Successfully completed pre-snapshot verification Normal PROGRESS 29s rok-csi [ 12%] Taking data snapshot... Taking dm-era metadata snapshot... Normal INFO 28s rok-csi Performing post-snapshot verification of snapshot device `/dev/mapper/roklvm-1bc96dda-017f-45f6-825c-4a11dba9a78e-data-snap'... Successfully completed post-snapshot verification of snapshot device `/dev/mapper/roklvm-1bc96dda-017f-45f6-825c-4a11dba9a78e-data-snap' Normal PROGRESS 27s rok-csi [ 21%] Uploading snapshot... Retrieving volume metadata... Cloning old snapshot... Warning JobFailed 26s rok-csi Job Failed: SCSI command 0xC3(SCSI_OPCODE_ROK_CLONE_FISK) failed with status `SCSI_STATUS_CHECK_CONDITION' and KCQ `SCSI_KCQ_LOGICAL_UNIT_NOT_READY_SPACE_ALLOCATION_IN_PROGRESS' (errno: No space left on device (28)): Run `kubectl logs -n rok rok-csi-node-xrdw6 -c csi-node' for more information
  7. Verify that the data written to the volume since its previous snapshot is more than the total size of the Rok Snapshots filesystem. Look for the following event in the output of the previous step in order to determine the amount of data written since its previous snapshot:

    Normal PROGRESS 51m rok-csi [ 31%] About to copy 53941051 changed blocks of size 4KiB (206GiB)...

    If the amount of changed data is less than the size of the filesystem in the previous step, then you do not need to increase the Rok Snapshot Space. Wait until Rok GC runs on the node and frees space up by deleting old snapshots.

    If it’s more, then move on with the Procedure section of the guide.

    Troubleshooting

    The aforementioned event doesn’t exist in the output of the previous step.

    Kubernetes removes events if they are older than a specific age (usually 1 hour). As a result, if you can’t find such an event, it might be that Kubernetes has removed it.

    Continue with the following steps:

    1. Get the usage of the Rok Snapshots filesystem and verify that the node is out of space, that is, the usage of the Rok Snapshots fileystem is at 100%:

      root@rok-tools:~# kubectl exec -n rok-system ${RDM:?} -- df -h /mnt/data/rok/ --output=file,pcent File Use% /mnt/data/rok/ 100%
    2. Get the volume size:

      root@rok-tools:~# kubectl get pv ${PV:?} -o json | jq -r '.spec.capacity.storage' 300Gi
    3. Get the total size of the Rok Snapshots filesystem:

      root@rok-tools:~# kubectl exec -n rok-system ${RDM:?} -- df -h /mnt/data/rok/ --output=file,size File Size /mnt/data/rok/ 164G
    4. Verify that the volume size is larger than the total size of the Rok Snapshot filesystem.

    Note

    The fact that the usage of the filesystem is at 100% or that the size of the volume is larger than the size of the filesystem doesn’t necessarily mean that there is not enough space to snapshot the volume.

    The filesystem might be out of space until Rok GC runs, and after that snapshots will recover.

    It all depends on whether the data written to the volume, since its previous snapshot, are more than the size of the Rok Snapshots filesystem.

    However, if the issue persists for a long time, then these two are good indications that you need more Rok Snapshot Space to snapshot the volume.

Procedure

  1. Get the PVC that has failed. Choose one of the following options, based on whether a VolumeSnapshot or the unpinning of a PVC has failed:

    1. Specify the name of the VolumeSnapshot:

      root@rok-tools:~# export VS=<VS_NAME>

      Replace <VS_NAME> with the name of the VolumeSnapshot, for example:

      root@rok-tools:~# export VS=test-notebook-workspace-jxjpm-snap
    2. Specify the namespace of the VolumeSnapshot:

      root@rok-tools:~# export NAMESPACE=<VS_NAMESPACE>

      Replace <VS_NAMESPACE> with the namespace of the VolumeSnapshot, for example:

      root@rok-tools:~# export NAMESPACE=kubeflow-user
    3. Get the PVC that is used as the source of the VolumeSnapshot:

      root@rok-tools:~# export PVC=$(kubectl get volumesnapshot ${VS:?} \ > -n ${NAMESPACE:?} -o jsonpath={.spec.source.persistentVolumeClaimName})
    1. Specify the name of the PVC:

      root@rok-tools:~# export PVC=<PVC_NAME>

      Replace <PVC_NAME> with the name of the PVC, for example:

      root@rok-tools:~# export PVC=test-notebook-workspace-jxjpm
    2. Specify the namespace of the PVC:

      root@rok-tools:~# export NAMESPACE=<PVC_NAMESPACE>

      Replace <PVC_NAMESPACE> with the namespace of the PVC, for example:

      root@rok-tools:~# export NAMESPACE=kubeflow-user
  2. Get the PV that is bound to the PVC:

    root@rok-tools:~/ops/deployments# export PV=$(kubectl get pvc \ > -n ${NAMESPACE:?} ${PVC:?} -o jsonpath={.spec.volumeName})
  3. Find the node where the volume lives:

    root@rok-tools:~# export NODE=$(kubectl get pv ${PV:?} -o json \ > | jq -r '.spec.nodeAffinity.required.nodeSelectorTerms[]?.matchExpressions[].values[]')
  4. Set the name of the block device that you are going to use for the new volume on the node:

    root@rok-tools:~# export DEVICE="/dev/sdx"
  5. Retrieve the availability zone of the node:

    root@rok-tools:~# export ZONE=$(kubectl get nodes ${NODE:?} \ > -ojsonpath="{.metadata.labels.topology\.kubernetes\.io/zone}")
  6. Retrieve the instance ID of the node. Choose one of the following options, based on your platform.

    root@rok-tools:~# export INSTANCE=$(kubectl get nodes ${NODE:?} \ > -o jsonpath={.spec.providerID} | sed 's|aws:///.*/||')

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  7. Retrieve the name of the Rok Disk Manager Pod running on the node:

    root@rok-tools:~# export RDM=$(kubectl get pod -n rok-system \ > --field-selector spec.nodeName==${NODE:?} -l name==rok-disk-manager \ > -o custom-columns=NAME:.metadata.name --no-headers)
  8. Retrieve the name of the Rok CSI Node Pod running on the node:

    root@rok-tools:~# export CSI_NODE=$(kubectl get pod -n rok \ > --field-selector spec.nodeName==${NODE:?} -l app==rok-csi-node -o \ > custom-columns=NAME:.metadata.name --no-headers)
  9. Set the size of the new volume to twice the size of the node’s Rok storage space:

    root@rok-tools:~# export VOLUME_SIZE=$(( $(kubectl exec -n rok-system \ > ${RDM:?} -- vgs rokvg --units G -o size --no-headings --no-suffix \ > | xargs -n1 printf "%.0f") * 2 ))
  10. Cordon the node to ensure no new volumes are provisioned on it:

    root@rok-tools:~# kubectl cordon ${NODE:?} node/ip-192-168-185-67.eu-central-1.compute.internal cordoned
  11. Create the new volume which you will use to extend the Rok Snapshot Space. Choose one of the following options, based on your platform.

    root@rok-tools:~# aws ec2 create-volume --volume-type gp2 \ > --size ${VOLUME_SIZE:?} --availability-zone ${ZONE:?} \ > --tag-specifications "ResourceType=volume,Tags=[{Key="rok.arrikto.com/node",Value=${NODE:?}}]" { "AvailabilityZone": "eu-central-1a", "CreateTime": "2023-03-03T10:30:35+00:00", "Encrypted": false, "Size": 1200, "SnapshotId": "", "State": "creating", "VolumeId": "vol-022f9cef32682337f", "Iops": 3600, "Tags": [ { "Key": "rok.arrikto.com/node", "Value": "ip-192-168-185-67.eu-central-1.compute.internal" } ], "VolumeType": "gp2", "MultiAttachEnabled": false }

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  12. Retrieve the ID of the volume. Choose one of the following options, based on your platform.

    root@rok-tools:~# export VOLUME_ID=$(aws ec2 describe-volumes \ > --filters Name=tag:"rok.arrikto.com/node",Values="${NODE:?}" \ > | jq -r '.Volumes[] | .VolumeId')

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  13. Attach the volume to the node. Choose one of the following options, based on your platform.

    root@rok-tools:~# aws ec2 attach-volume \ > --volume-id ${VOLUME_ID:?} \ > --device ${DEVICE:?} \ > --instance-id ${INSTANCE:?} { "AttachTime": "2023-03-03T10:52:24.079000+00:00", "Device": "/dev/sdx", "InstanceId": "i-09c640bd00df1a90e", "State": "attaching", "VolumeId": "vol-022f9cef32682337f" }

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  14. Wait until the volume has been attached to the node. Choose one of the following options, based on your platform.

    Wait until the State field of the output becomes attached:

    root@rok-tools:~# watch aws ec2 describe-volumes --volume-ids ${VOLUME_ID:?} --query Volumes[0].Attachments[0].{State:State} Every 2.0s: aws ec2 describe-volumes --volume-ids vol-022f9cef32682337f --query Volumes[0].Attachments[0].{State:State} { "State": "attached" }

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  15. Enable volume deletion upon instance termination. Choose one of the following options, based on your platform.

    root@rok-tools:~# aws ec2 modify-instance-attribute \ > --instance-id ${INSTANCE:?} \ > --block-device-mappings "[{\"DeviceName\":\"${DEVICE:?}\",\"Ebs\":{\"DeleteOnTermination\":true,\"VolumeId\":\"${VOLUME_ID:?}\"}}]"

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  16. Wait until delete on termination has been enabled. Choose one of the following options, based on your platform.

    Wait until the DeleteOnTermination field of the output becomes true. This might take a while:

    root@rok-tools:~# watch aws ec2 describe-volumes --volume-ids ${VOLUME_ID:?} \ > --query Volumes[0].Attachments[0].{DeleteOnTermination:DeleteOnTermination} Every 2.0s: aws ec2 describe-volumes --volume-ids vol-022f9cef32682337f --query Volumes[0].Attachments[0].{DeleteOnTermination:DeleteOnTermination} { "DeleteOnTermination": true }

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

  17. Extend the Volume Group used by Rok with the volume you just attached to the node:

    root@rok-tools:~# kubectl exec -n rok ${CSI_NODE:?} -c csi-node -- vgextend rokvg ${DEVICE:?} Physical volume "/dev/sdx" successfully created. Volume group "rokvg" successfully extended
  18. Extend the Logical Volume backing the Rok Snapshots filesystem:

    root@rok-tools:~# kubectl exec -n rok ${CSI_NODE:?} -c csi-node -- lvextend rokvg/rok-fisks -L +${VOLUME_SIZE:?}G Size of logical volume rokvg/rok-fisks changed from 167.56 GiB (42896 extents) to <1.34 TiB (350096 extents). Logical volume rokvg/rok-fisks successfully resized.
  19. Grow the Rok Snapshots filesystem:

    root@rok-tools:~# kubectl exec -n rok-system ${RDM:?} -- resize2fs /dev/rokvg/rok-fisks resize2fs 1.44.5 (15-Dec-2018) Filesystem at /dev/rokvg/rok-fisks is mounted on /mnt/data; on-line resizing required old_desc_blocks = 21, new_desc_blocks = 171 The filesystem on /dev/rokvg/rok-fisks is now 358498304 (4k) blocks long.
  20. Drain and remove the node. Rok CSI will snapshot and unpin the volume, and, since you have extended the Rok Snapshot Space, the snapshot should succeed and the drain operation should complete successfully after a while. Choose one of the following options, based on your platform.

    Important

    Failure to remove the node will render Rok unusable on it after a node reboot, since Rok cannot handle the extra disk we used to extend the Rok Volume Group.

    Follow the Scale In EKS Cluster guide to drain and remove the node.

    This section is a work in progress.

    This section is a work in progress.

    This section is a work in progress.

Summary

You have successfully snapshotted and unpinned a volume with more changed data than the configured size of Rok Snapshot Space.

What’s Next

Check out the rest of the maintenance operations that you can perform on your cluster.