Recover RWX Volume After Node Failure

Volumes created by Rok are bound to a specific node. This is true even for RWX volumes, where the data live on a specific node, but they are accessible from any node in the cluster.

If that node goes away, for example, due to node failure, the volume data will be inaccessible.

Rok supports Volume Auto-recovery to automatically recover these volumes and move them to a new node. Still, for RWX volumes, you have to follow a manual procedure in order to ensure that all Pods have a consistent view of the volumes and are able to write to them again.

Here’s how Rok automatically recovers a volume:

  • The home node of a RWX volume, that is, the node where the volume data live, goes away due to node failure.

  • Pods using the volume on remote nodes, that is, nodes other than the one where the volume data live, continue to operate, although attempts to read or write to the volume might block until Rok has recovered the volume.

    Important

    Pods on remote nodes access the volume through the network, but cache some of its data locally. This can lead to Pods having an inconsistent view of the volume after Rok recovers it from its latest snapshot.

  • Rok notices and unpins the volume from the failed node.

  • Rok rolls the volume back to its latest snapshot, and recovers it in read-only mode on a new node.

    Important

    Recovering the volume read-only is important to prevent Pods on remote nodes, which might have an inconsistent view of the filesystem due to cached data, from writing to the volume and corrupting it.

  • Existing Pods using the volume cannot write to it, and may have an inconsistent view of the filesystem.

  • Rok fails requests from new Pods to mount the volume, until all existing users go away.

To recover from this situation you have to force a remount of the filesystem in all Pods that use the volume. This ensures that all Pods have a consistent view of the filesystem, and thus it’s safe to mount the volume read-write again, and continue normal operation.

Check Your Environment

Identify affected RWX PVCs by checking for any of the following conditions:

  1. Existing Pods using the volume return EROFS (Read-only file system) or EBADF (Bad file descriptor) I/O errors.

  2. You can see an event like the following in the events of the RWX PVC. Replace <PVC_NAMESPACE> with the namespace and <PVC_NAME> with the name of the PVC you want to inspect:

    root@rok-tools:/# kubectl describe pvc -n <PVC_NAMESPACE> <PVC_NAME> Warning WARNING 9m59s rok-csi About to recover access-server volume 'pvc-06dad9fe-4a24-4a85-a123-b5426581a856': You will need to stop any Pods which are currently using RWX volume 'pvc-6201fa6e-59b4-4d36-a4b8-ac0af27ca4a1'
  3. You can see events like the following in the events of new Pods trying to use the volume. Replace <POD_NAMESPACE> with the namespace and <POD_NAME> with the name of the Pod you want to inspect:

    root@rok-tools:/# kubectl describe pods -n <POD_NAMESPACE> <POD_NAME> Warning FailedMount 6s (x20 over 9m30s) kubelet MountVolume.SetUp failed for volume "pvc-6201fa6e-59b4-4d36-a4b8-ac0af27ca4a1" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = FailedPrecondition desc = Cannot publish recovered RWX volume 'pvc-6201fa6e-59b4-4d36-a4b8-ac0af27ca4a1' yet: You have to stop any Pods which are currently using the volume

Procedure

  1. Specify the name of the RWX PVC. Replace <PVC_NAME> with the name of the PVC you want to recover:

    root@rok-tools:/# export PVC_NAME=<PVC_NAME>
  2. Specify the namespace of the RWX PVC. Replace <PVC_NAMESPACE> with the namespace of the PVC:

    root@rok-tools:/# export PVC_NAMESPACE=<PVC_NAMESPACE>
  3. Stop all Notebooks and Pipelines using the RWX PVC, if you are using Kubeflow.

  4. List Deployments using the RWX PVC:

    root@rok-tools:/# kubectl get deployments.apps \ > -n ${PVC_NAMESPACE?} \ > -ojson \ > | jq -r '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "'${PVC_NAME?}'") | .metadata.name' nginx-deployment my-app
  5. Scale down to zero all Deployments that are using the RWX PVC.

    Repeat the steps a-d below for each one of the Deployments in the list shown in the previous step.

    1. Pick a Deployment from the list:

      root@rok-tools:/# export DEPLOYMENT=<DEPLOYMENT>
    2. Scale the Deployment down to zero:

      root@rok-tools:/# kubectl scale deployment \ > -n ${PVC_NAMESPACE?} \ > --replicas=0 \ > ${DEPLOYMENT?}
    3. Wait until the Deployment has scaled down to zero:

      root@rok-tools:/# kubectl get deployments.apps \ > -n ${PVC_NAMESPACE?} \ > ${DEPLOYMENT?}
    4. Go back to step a, and repeat the steps for the remaining Deployments.

  6. List StatefulSets that use the RWX PVC:

    root@rok-tools:/# kubectl get statefulsets.apps \ > -n ${PVC_NAMESPACE?} \ > -ojson \ > | jq -r '.items[] | select(.spec.volumeClaimTemplates[]?.metadata.name == "'${PVC_NAME?}'") | .metadata.name' web-app
  7. Scale down to zero all StatefulSets using the RWX PVC.

    Repeat the steps a-d below for each one of the StatefulSets in the list shown in the previous step.

    1. Pick a StatefulSet from the list:

      root@rok-tools:/# export STS=<STATEFULSET>
    2. Scale the StatefulSet down to zero:

      root@rok-tools:/# kubectl scale statefulset \ > -n ${PVC_NAMESPACE?} \ > --replicas=0 \ > ${STS?}
    3. Wait until the StatefulSet has scaled down to zero:

      root@rok-tools:/# kubectl get statefulsets.apps \ > -n ${PVC_NAMESPACE?} \ > ${STS?}
    4. Go back to step a, and repeat the steps for the remaining StatefulSets.

  8. Scale any custom controllers that manage Pods that use the RWX PVC, down to zero, and wait until they have scaled down.

  9. List all remaining Pods that use the RWX PVC:

    root@rok-tools:/# kubectl get pods \ > -n ${PVC_NAMESPACE?} \ > -ojson \ > | jq -r '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName == "'${PVC_NAME?}'") | .metadata.name' user-pod-1 user-pod-2 user-pod-3
  10. Delete any remaining Pods that use the RWX PVC.

    Repeat steps a-d below for each one of the Pods in the list shown in the previous step.

    1. Pick a Pod to delete from the list:

      root@rok-tools:/# export POD=<POD>
    2. Delete the Pod:

      root@rok-tools:/# kubectl delete pod -n ${PVC_NAMESPACE?} ${POD?}
    3. Wait until the Pod has been deleted:

      root@rok-tools:/# kubectl get pod -n ${PVC_NAMESPACE?} ${POD?}
    4. Go back to step a, and repeat the steps for the remaining Pods.

Verify

  1. Verify that all the users of the RWX PVC are gone, by inspecting the events of the PVC:

    root@rok-tools:/# kubectl describe pvc -n ${PVC_NAMESPACE?} ${PVC_NAME?} Normal INFO 55s rok-csi All Pods using RWX volume 'pvc-6201fa6e-59b4-4d36-a4b8-ac0af27ca4a1' are gone: You can start using the volume again

Summary

You have successfully recovered the RWX PVC and you can now start using it again. For example, you can now scale up (recreate) all workloads using the PVC, which you scaled down (deleted) in previous steps.

What’s Next

Check out the rest of the maintenance operations that you can perform on your cluster.