Restore Failing Rok etcd Member

This guide will walk you through restoring a failing Rok etcd member without disrupting the availability of the cluster.

Important

This guide assumes that the etcd cluster has one failing member, but the remaining members are up and thus the cluster is still operational.

See also

Check Your Environment

  1. Ensure that the etcd cluster is currently healthy despite having failing members. Inspect the etcd endpoint and verify that the HEALTH field is true:

    root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl endpoint health -w table +----------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +----------------+--------+------------+-------+ | 127.0.0.1:2379 | true | 1.320642ms | | +----------------+--------+------------+-------+

Procedure

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. List the members of the etcd cluster:

    root@rok-tools:~/ops/deployments# kubectl get pods -n rok -l app=etcd NAME READY STATUS RESTARTS AGE rok-etcd-0 2/2 Running 0 27m rok-etcd-1 1/2 Error 1 17s rok-etcd-2 2/2 Running 0 29m
  3. Specify the name of the failing Pod:

    root@rok-tools:~/ops/deployments# export ETCD_POD=<POD_NAME>

    Replace <POD_NAME> with the name of the failing Pod. For example, to restore member rok-etcd-1 in the above example, specify the following:

    root@rok-tools:~/ops/deployments# export ETCD_POD=rok-etcd-1
  4. Set the name of the etcd member:

    root@rok-tools:~/ops/deployments# export NAME=${ETCD_POD?}.rok-etcd-cluster.rok
  5. Set the URL of the etcd member:

    root@rok-tools:~/ops/deployments# export PEER_URL=http://${ETCD_POD?}.rok-etcd-cluster.rok:2380
  6. Retrieve the ID of the failing member:

    root@rok-tools:~/ops/deployments# export ID=$(kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w json --hex \ > | jq -r '.members[] | select(.name == "'${NAME?}'") | .ID') \ > && echo ${ID?} 39212b442e2e4e54
  7. Remove the member from the etcd cluster:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member remove ${ID?} Member 39212b442e2e4e54 removed from cluster 844c2991de84c0b
  8. Add a new member with the same name and URL to the cluster:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member add --learner ${NAME?} --peer-urls ${PEER_URL?} Member 49a1544e41ae84e4 added to cluster 844c2991de84c0b ETCD_NAME="rok-etcd-1.rok-etcd-cluster.rok" ETCD_INITIAL_CLUSTER="rok-etcd-2.rok-etcd-cluster.rok=http://rok-etcd-2.rok-etcd-cluster.rok:2380,rok-etcd-1.rok-etcd-cluster.rok=http://rok-etcd-1.rok-etcd-cluster.rok:2380,rok-etcd-0.rok-etcd-cluster.rok=http://rok-etcd-0.rok-etcd-cluster.rok:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://rok-etcd-1.rok-etcd-cluster.rok:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
  9. Set the etcd cluster state:

    root@rok-tools:~/ops/deployments# export ETCD_CLUSTER_STATE=existing
  10. Render the patch for the etcd cluster state:

    root@rok-tools:~/ops/deployments# j2 \ > rok/rok-external-services/etcd/overlays/deploy/patches/cluster-state.yaml.j2 \ > -o rok/rok-external-services/etcd/overlays/deploy/patches/cluster-state.yaml
  11. Edit rok/rok-external-services/etcd/overlays/deploy/kustomization.yaml and ensure that both cluster-size and cluster-state patches are enabled:

    patches: - path: patches/cluster-size.yaml target: kind: StatefulSet name: etcd - path: patches/cluster-state.yaml
  12. Commit your changes:

    root@rok-tools:~/ops/deployments# git commit -am "Update Rok etcd cluster state"
  13. Apply the kustomization:

    root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-external-services/etcd/overlays/deploy
  14. Set the PVC name of the failing etcd Pod:

    root@rok-tools:~/ops/deployments# export PVC=data-${ETCD_POD?}
  15. Delete the PVC of the failing etcd Pod:

    root@rok-tools:~/ops/deployments# kubectl delete pvc -n rok ${PVC?} --wait=false

    Note

    The PVC is in use by the failing Pod, which you are about to delete. Use --wait=false otherwise kubectl will hang.

  16. Delete the failing etcd Pod:

    root@rok-tools:~/ops/deployments# kubectl delete pod -n rok ${ETCD_POD?}
  17. Wait for a few minutes to give the new member a chance to join the cluster and retrieve its member ID. Ensure the following command outputs SUCCESS:

    root@rok-tools:~/ops/deployments# export ID=$(kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w json --hex \ > | jq -r '.members[] | select(.name == "'${NAME?}'") | .ID') \ > && [[ -z "${ID?}" ]] && echo ERROR || echo SUCCESS SUCCESS

    Troubleshooting

    The command output is ERROR

    If the new member has not yet managed to join the cluster, then its name will be empty and the above command will output ERROR. In this case, wait for a few minutes to allow the new member to start and join the cluster, and try again.

  18. Promote the new member to a voting member:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member promote ${ID?} Member 49a1544e41ae84e4 promoted in cluster 844c2991de84c0b

    Troubleshooting

    The member is not in sync with the leader

    If you try to promote the new member before it has managed to catch up with the cluster, then the command will fail with the following error:

    Error: etcdserver: can only promote a learner member which is in sync with leader

    In this case, wait for a few more minutes and try again.

Verify

  1. Ensure that all Rok etcd Pods are ready. Verify that field READY is 2/2 and field STATUS is Running for all Pods:

    root@rok-tools:~/ops/deployments# watch kubectl get pods -n rok -l app=etcd Every 2.0s: kubectl get pods -n rok -l app=etcd rok-tools: Mon Aug 8 12:36:35 2022 NAME READY STATUS RESTARTS AGE rok-etcd-0 2/2 Running 0 2d22h rok-etcd-1 2/2 Running 0 2d22h rok-etcd-2 2/2 Running 0 2d22h
  2. Retrieve the endpoints of all etcd cluster members:

    root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)
  3. Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:

    root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | | rok-etcd-2.rok-etcd-cluster.rok:2379 | true | 9.317423ms | | +--------------------------------------+--------+------------+-------+
  4. Ensure that the Rok etcd cluster has the expected member count. Verify that the output of the following command is for example 3:

    root@rok-tools:~# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list | wc -l 3
  5. List the members of the etcd cluster. Verify that field STATUS is started field IS LEARNER is false for all members:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w table +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+ | 28e727eb210f314d | started | rok-etcd-2.rok-etcd-cluster.rok | http://rok-etcd-2.rok-etcd-cluster.rok:2380 | http://rok-etcd-2.rok-etcd-cluster.rok:2379 | false | | b2ff88bb2eae13b7 | started | rok-etcd-0.rok-etcd-cluster.rok | http://rok-etcd-0.rok-etcd-cluster.rok:2380 | http://rok-etcd-0.rok-etcd-cluster.rok:2379 | false | | f823900dacf44825 | started | rok-etcd-1.rok-etcd-cluster.rok | http://rok-etcd-1.rok-etcd-cluster.rok:2380 | http://rok-etcd-1.rok-etcd-cluster.rok:2379 | false | +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+

Summary

You have successfully restored a failing member of the Rok etcd cluster.

What’s Next

Check out the rest of the maintenance operations that you can perform on your cluster.