Scale Up Rok etcd

This guide will walk you through increasing the size of your Rok etcd cluster by introducing one additional member. For adding more members, simply follow the guide again.

Important

To withstand a node failure, use a cluster with at least three members. We highly recommend you use an odd number of members and no more than seven members.

See also

Check Your Environment

  1. Retrieve the endpoints of all etcd cluster members:

    root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)
  2. Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:

    root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | +--------------------------------------+--------+------------+-------+

Procedure

  1. Go to your GitOps repository, inside your rok-tools management environment:

    root@rok-tools:~# cd ~/ops/deployments
  2. Retrieve the current size of the etcd cluster:

    root@rok-tools:~/ops/deployments# export ETCD_CLUSTER_SIZE=$(kubectl get sts \ > -n rok rok-etcd -o jsonpath="{.spec.replicas}") \ > && echo ${ETCD_CLUSTER_SIZE?} 2
  3. Set the name of the new etcd member:

    root@rok-tools:~/ops/deployments# export \ > NAME=rok-etcd-${ETCD_CLUSTER_SIZE?}.rok-etcd-cluster.rok
  4. Set the URL of the new etcd member:

    root@rok-tools:~/ops/deployments# export \ > PEER_URL=http://rok-etcd-${ETCD_CLUSTER_SIZE?}.rok-etcd-cluster.rok:2380
  5. Add a new member to the etcd cluster:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member add --learner ${NAME?} --peer-urls ${PEER_URL?} Member 49a1544e41ae84e4 added to cluster 844c2991de84c0b ETCD_NAME="rok-etcd-2.rok-etcd-cluster.rok" ETCD_INITIAL_CLUSTER="rok-etcd-2.rok-etcd-cluster.rok=http://rok-etcd-2.rok-etcd-cluster.rok:2380,rok-etcd-1.rok-etcd-cluster.rok=http://rok-etcd-1.rok-etcd-cluster.rok:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://rok-etcd-2.rok-etcd-cluster.rok:2380" ETCD_INITIAL_CLUSTER_STATE="existing"

    Troubleshooting

    Error: etcdserver: unhealthy cluster

    There are cases, mostly due to a network hiccup, where an existing member rejoins the cluster, for example after a Pod restart, and other members end up considering it inactive. In such a case, member add fails with:

    {"level":"warn","ts":"2022-09-23T09:52:00.805Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000458a80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"} Error: etcdserver: unhealthy cluster

    At the same time, the etcd cluster remains operational and clients are able to access it and make read/write requests.

    To recover, follow the steps below:

    1. Retrieve the endpoints of all etcd cluster members:

      root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)
    2. Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:

      root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | +--------------------------------------+--------+------------+-------+
    3. Restart the etcd Pods:

      root@rok-tools:~/# kubectl delete pods -n rok -l app=etcd
    4. Rerun the command to add a member to the etcd cluster.

  6. Increase the etcd cluster size:

    root@rok-tools:~/ops/deployments# let ETCD_CLUSTER_SIZE++
  7. Render the patch for the cluster size:

    root@rok-tools:~/ops/deployments# j2 \ > rok/rok-external-services/etcd/overlays/deploy/patches/cluster-size.yaml.j2 \ > -o rok/rok-external-services/etcd/overlays/deploy/patches/cluster-size.yaml
  8. Set the cluster state:

    root@rok-tools:~/ops/deployments# export ETCD_CLUSTER_STATE=existing
  9. Render the patch for the cluster state:

    root@rok-tools:~/ops/deployments# j2 \ > rok/rok-external-services/etcd/overlays/deploy/patches/cluster-state.yaml.j2 \ > -o rok/rok-external-services/etcd/overlays/deploy/patches/cluster-state.yaml
  10. Edit rok/rok-external-services/etcd/overlays/deploy/kustomization.yaml and ensure that both cluster-size and cluster-state patches are enabled:

    patches: - path: patches/cluster-size.yaml target: kind: StatefulSet name: etcd - path: patches/cluster-state.yaml
  11. Commit your changes:

    root@rok-tools:~/ops/deployments# git commit -am "Scale Rok etcd to ${ETCD_CLUSTER_SIZE?} members"
  12. Apply the kustomization:

    root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-external-services/etcd/overlays/deploy
  13. Wait for a few minutes to give the new member a chance to join the cluster and retrieve its member ID. Ensure the following command outputs SUCCESS:

    root@rok-tools:~/ops/deployments# export ID=$(kubectl \ > exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w json --hex \ > | jq -r '.members[] | select(.name == "'${NAME?}'") | .ID') \ > && [[ -z "${ID?}" ]] && echo ERROR || echo SUCCESS SUCCESS

    Troubleshooting

    The command output is ERROR

    If the new member has not yet managed to join the cluster, then its name will be empty and the above command will output ERROR. In this case, wait for a few minutes to allow the new member to start and join the cluster, and try again.

  14. Promote the new member to a voting member:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member promote ${ID?} Member 49a1544e41ae84e4 promoted in cluster 4c194b295a903d33

    Troubleshooting

    The member is not in sync with the leader

    If the above command fails with the following error:

    Error: etcdserver: can only promote a learner member which is in sync with leader

    it means that you try to promote the new member before it has managed to catch up with the cluster. In this case, wait for a few more minutes and try again.

Verify

  1. Ensure that all Rok etcd Pods are ready. Verify that field READY is 2/2 and field STATUS is Running for all Pods:

    root@rok-tools:~/ops/deployments# kubectl get pods -n rok -l app=etcd NAME READY STATUS RESTARTS AGE rok-etcd-0 2/2 Running 0 2d22h rok-etcd-1 2/2 Running 0 2d22h rok-etcd-2 2/2 Running 0 2d22h
  2. Retrieve the endpoints of all etcd cluster members:

    root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)
  3. Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:

    root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | | rok-etcd-2.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | +--------------------------------------+--------+------------+-------+
  4. Ensure that the Rok etcd cluster has the expected member count. Verify that the output of the following command is for example 2:

    root@rok-tools:~# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list | wc -l 2
  5. List the members of the etcd cluster. Verify that field STATUS is started and field IS LEARNER is false for all members:

    root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w table +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+ | b2ff88bb2eae13b7 | started | rok-etcd-0.rok-etcd-cluster.rok | http://rok-etcd-0.rok-etcd-cluster.rok:2380 | http://rok-etcd-0.rok-etcd-cluster.rok:2379 | false | | f823900dacf44825 | started | rok-etcd-1.rok-etcd-cluster.rok | http://rok-etcd-1.rok-etcd-cluster.rok:2380 | http://rok-etcd-1.rok-etcd-cluster.rok:2379 | false | | 49a1544e41ae84e4 | started | rok-etcd-2.rok-etcd-cluster.rok | http://rok-etcd-2.rok-etcd-cluster.rok:2380 | http://rok-etcd-2.rok-etcd-cluster.rok:2379 | false | +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+

Summary

You have successfully added a member to the Rok etcd cluster.

What’s Next

Check out the rest of the maintenance operations you can perform on your Rok etcd cluster.