Disable Rok on a Node Group¶
Rok runs as a DaemonSet, so it runs on all the nodes of a cluster, except for any nodes that have taints that the DaemonSet does not tolerate.
EKF ships its own custom Cluster Autoscaler, which expects that every node of the cluster runs Rok, except for nodes that are marked with a specific label to indicate that Rok is disabled.
In order to disable Rok on a specific node group, while allowing the Cluster Autoscaler to seamlessly scale every node group in the cluster, the node group needs appropriate configuration.
This guide will walk you through configuring an existing node group to disable Rok on it, while allowing the seamless autoscaling of the node group.
What You’ll Need¶
- A configured management environment.
- An existing EKS cluster.
- An existing node group that has desired size set to zero.
- An existing Cluster Autoscaler deployment scaled down to zero replicas.
Check Your Environment¶
Before proceeding to the configuration of the existing node group, you need to ensure that the desired size of the node group is set to zero and that the deployment of the Cluster Autoscaler is scaled down to zero.
This is to safeguard that there are not any nodes of the node group that already run Rok, and that the Cluster Autoscaler will not trigger any undesired scale-up while configuring the node group.
Attention
By disabling Rok on a node group, any Pods requesting Rok storage will not be scheduled on the nodes of the node group.
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context:
root@rok-tools:~/ops/deployments:~# source deploy/env.eks-clusterroot@rok-tools:~/ops/deployments:~# export EKS_CLUSTERCheck that the Cluster Autoscaler deployment is scaled down to zero:
root@rok-tools:~/ops/deployments# kubectl get deployments \ > -n kube-system cluster-autoscaler -ojson \ > | jq -e '.spec.replicas == 0' >/dev/null && echo OK || echo FAIL OKTroubleshooting
The output of the command is FAIL
The Cluster Autoscaler is not scaled down to zero replicas. Scale it down by running:
root@rok-tools:~/ops/deployments# kubectl scale deployment -n kube-system cluster-autoscaler --replicas=0Ensure that the node group for which you want to disable Rok has desired size zero. Choose one of the following options based on your node group type.
Set the name of the node group:
root@rok-tools:~/ops/deployments# export NODEGROUP=<NODEGROUP>Replace
<NODEGROUP>
with the node group name. For example:root@rok-tools:~/ops/deployments# export NODEGROUP=general-non-rok-workersCheck that the desired size of the node group is zero:
root@rok-tools:~/ops/deployments# [[ $(aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > --query nodegroup.scalingConfig.desiredSize) == 0 ]] && echo OK || echo FAIL OKTroubleshooting
The output of the command is FAIL
The desired size of the node group is not zero. Scale down the node group by following the Scale In EKS Cluster guide.
It’s important to drain the existing nodes of the node group, as the guide instructs you, to unpin any Rok volumes that may live on these nodes.
This section is work in progress.
Procedure¶
Restore the required context from previous sections:
root@rok-tools:~/ops/deployments# source deploy/env.eks-clusterUpdate the node group configuration. Choose one of the following options based on your node group type.
Set the name of the node group:
root@rok-tools:~/ops/deployments# export NODEGROUP=<NODEGROUP>Replace
<NODEGROUP>
with the node group name. For example:root@rok-tools:~/ops/deployments# export NODEGROUP=general-non-rok-workersSet the taints and labels to be added to the node group:
root@rok-tools:~/ops/deployments# export LABEL_KEY=rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export LABEL_VALUE=trueroot@rok-tools:~/ops/deployments# export TAINT_KEY=rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export TAINT_EFFECT=NO_SCHEDULENote
The
rok.arrikto.com/disabled: true
label is used by the Cluster Autoscaler in order to determine if Rok is disabled on a node group. By configuring the managed node group with the label, any new nodes of it will have this label. The value of the label must be set totrue
, otherwise the Cluster Autoscaler will ignore it.The
rok.arrikto.com/disabled:NoSchedule
taint is used in order to prevent Rok from running on the nodes of the node group. By configuring the node group with this taint, any new nodes of it will have this taint.Update the node group configuration with the labels and the taints:
root@rok-tools:~/ops/deployments# aws eks update-nodegroup-config \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name=${NODEGROUP?} \ > --labels addOrUpdateLabels="{${LABEL_KEY?}=${LABEL_VALUE?}}" \ > --taints addOrUpdateTaints="{key=${TAINT_KEY?},effect=${TAINT_EFFECT?}}"Retrieve the underlying Auto Scaling group of the managed node group:
root@rok-tools:~/ops/deployments# ASG=$(aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > | jq .nodegroup.resources.autoScalingGroups[0].name)Set the tags to be added to the Auto Scaling group:
root@rok-tools:~/ops/deployments# export LABEL_TAG_KEY=k8s.io/cluster-autoscaler/node-template/label/rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export LABEL_TAG_VALUE=trueroot@rok-tools:~/ops/deployments# export TAINT_TAG_KEY=k8s.io/cluster-autoscaler/node-template/taint/rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export TAINT_TAG_VALUE=:NoScheduleNote
The Cluster Autoscaler relies on these tags of the Auto Scaling group to understand what labels and taints a node of the node group will have. The Cluster Autoscaler will use these tags only the first time it scales up from zero the node group. As soon as a live node joins the cluster, the Cluster Autoscaler will use that node in order to determine the actual labels and taints a node of the node group has. EKS does not derive these tags from the taints and the labels of the managed node group, so they have to be manually added.
Moreover, even though EKS supports tags for both managed node groups and their underlying Auto Scaling Group, it does not propagate any tags of the managed node group to the underlying Auto Scaling group, so the user has to explicitly add the tags to the Auto Scaling group. See https://github.com/aws/containers-roadmap/issues/608.
Update the tags of the Auto Scaling Group:
root@rok-tools:~/ops/deployments# aws autoscaling create-or-update-tags \ > --tags ResourceId=${ASG?},ResourceType=auto-scaling-group,Key=${TAINT_TAG_KEY?},Value=${TAINT_TAG_VALUE?},PropagateAtLaunch=true \ > ResourceId=${ASG?},ResourceType=auto-scaling-group,Key=${LABEL_TAG_KEY?},Value=${LABEL_TAG_VALUE?},PropagateAtLaunch=true
This section is work in progress.
Verify¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context:
root@rok-tools:~/ops/deployments:~# source deploy/env.eks-clusterroot@rok-tools:~/ops/deployments:~# export EKS_CLUSTERChoose one of the following options based on your node group type.
Set the name of the node group:
root@rok-tools:~/ops/deployments# export NODEGROUP=<NODEGROUP>Replace
<NODEGROUP>
with the node group name. For example:root@rok-tools:~/ops/deployments# export NODEGROUP=general-non-rok-workersSet the taints and labels to be added to the node group:
root@rok-tools:~/ops/deployments# export LABEL_KEY=rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export LABEL_VALUE=trueroot@rok-tools:~/ops/deployments# export TAINT_KEY=rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export TAINT_EFFECT=NO_SCHEDULEVerify that the
rok.arrikto.com/disabled: true
label exists:root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP} \ > --query nodegroup.labels \ > --output json \ > | jq "to_entries[] | select(.key == \"${LABEL_KEY?}\" and .value == \"${LABEL_VALUE?}\" )" { "key": "rok.arrikto.com/disabled", "value": "true" }Verify that the
rok.arrikto.com/disabled:NoSchedule
taint exists:root@rok-tools:~/ops/deployments# aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP} \ > --query nodegroup.taints \ > --output json \ > | jq "values | .[] | select(.key == \"${TAINT_KEY?}\" and .effect == \"${TAINT_EFFECT?}\")" { "key": "rok.arrikto.com/disabled", "value": "true" }Retrieve the underlying Auto Scaling group of the managed node group:
root@rok-tools:~/ops/deployments# export ASG=$(aws eks describe-nodegroup \ > --cluster-name ${EKS_CLUSTER?} \ > --nodegroup-name ${NODEGROUP?} \ > | jq .nodegroup.resources.autoScalingGroups[0].name)Set the tags that must have been added to the Auto Scaling group:
root@rok-tools:~/ops/deployments# export LABEL_TAG_KEY=k8s.io/cluster-autoscaler/node-template/label/rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export LABEL_TAG_VALUE=trueroot@rok-tools:~/ops/deployments# export TAINT_TAG_KEY=k8s.io/cluster-autoscaler/node-template/taint/rok.arrikto.com/disabledroot@rok-tools:~/ops/deployments# export TAINT_TAG_VALUE=:NoScheduleVerify that the label tag of the underlying Auto Scaling group exists:
root@rok-tools:~/ops/deployments# aws autoscaling describe-tags \ > --filters Name=auto-scaling-group,Values=${ASG?} Name=key,Values=${LABEL_TAG_KEY?} \ > Name=value,Values=${LABEL_TAG_VALUE?} --query=Tags [ { "ResourceId": "eks-general-non-rok-workers-64c306b8-788c-043a-ff32-4d8e853647d6", "ResourceType": "auto-scaling-group", "Key": "k8s.io/cluster-autoscaler/node-template/label/rok.arrikto.com/disabled", "Value": "true", "PropagateAtLaunch": true } ]Verify that the taint tag of the underlying Auto Scaling group exists:
root@rok-tools:~/ops/deployments# aws autoscaling describe-tags \ > --filters Name=auto-scaling-group,Values=${ASG?} Name=key,Values=${TAINT_TAG_KEY?} \ > Name=value,Values=${TAINT_TAG_VALUE?} --query=Tags [ { "ResourceId": "eks-general-non-rok-workers-64c306b8-788c-043a-ff32-4d8e853647d6", "ResourceType": "auto-scaling-group", "Key": "k8s.io/cluster-autoscaler/node-template/taint/rok.arrikto.com/disabled", "Value": ":NoSchedule", "PropagateAtLaunch": true } ]
This section is work in progress.
What’s Next¶
Check out the rest of the maintenance operations that you can perform on your cluster.