In Kubernetes, efficient resource management and optimal scheduling of Pods are crucial for maintaining a well-functioning cluster. However, simply relying on default scheduling policies is often insufficient for more complex workloads and environments. This is where Taints, Tolerations, and Node Affinity come into play. Let's explore these concepts in detail to understand how they can be effectively leveraged in a Kubernetes cluster.
Taints and Tolerations
Taints are applied to nodes and allow a node to repel a set of pods. Think of taints as a way to mark a node with specific characteristics that make it unsuitable for certain pods.
Tolerations are applied to pods and allow the pods to tolerate (i.e., be scheduled on) nodes with specific taints. Tolerations enable exceptions to the rules set by taints.
How Taints and Tolerations Work
Taints and Tolerations work together to ensure that pods are only scheduled on appropriate nodes. Here’s how they interact:
Applying Taints: A taint is added to a node to mark it with a key-value pair and an effect. This indicates that only pods with matching tolerations should be scheduled on that node.
Applying Tolerations: Tolerations are added to pods, allowing them to be scheduled on nodes with matching taints.
Scheduling Decision: The Kubernetes scheduler checks the taints on each node and the tolerations on each pod. If a pod tolerates a node’s taint, it can be scheduled on that node.
Taints
A taint is a key-value pair with an effect that is applied to a node. The key-value pair can represent any condition or attribute, and the effect determines what happens to Pods that do not tolerate the taint. There are three possible effects:
NoSchedule
: The Pod will not be scheduled on the node unless it tolerates the taint.PreferNoSchedule
: The system will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not a hard requirement.NoExecute
: The Pod will be evicted if it is already running on the node and does not tolerate the taint.
Example of Applying a Taint
To apply a taint to a node, you use the kubectl taint
command. For example, to taint a node named node1
with the key key1
, value value1
, and effect NoSchedule
:
kubectl taint node node1 key1=value1:NoSchedule
we have two nodes: 1 control plane and 2 worker node. We taint the both worker nodes with a gpu=true
key-value pair. When we create a pod, it shows a pending state.
When we describe our pod using below command
kubectl describe pod/<pod-name>
The reason behind our pod not being scheduled on the node is that our control node has the taint {
node-role.kubernetes.io/control-plane
: }
, meaning only control-component pods will schedule on this node. Our worker nodes have the taint {gpu=true}
that we specified, and these nodes are looking for a pod with the toleration {gpu=true}
.
Tolerations
A toleration is applied to a Pod to indicate that it can tolerate specific taints. This is done by adding a toleration section to the Pod's specification.
Example of Applying a Toleration
Let's apply toleration to our pod. Create a file pod1.yaml
with the following content:
apiVersion: v1
kind: Pod
metadata:
labels:
run: redis
name: redis
spec:
containers:
- image: redis
name: redis
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
In this YAML file, we have made a toleration on our redis
pod with a key-value pair of gpu=true
. When we apply this file (kubectl apply -f pod1.yaml
), this pod will schedule on an untainted node or a tainted node with a key-value pair of {gpu=true}
and the pod will be in running status.
Important Points to Remember about Taints and Tolerations
Taints are set on Nodes.
Tolerations are set on Pods.
Tainted nodes will only accept pods that have a similar toleration set.
A pod (with or without a particular toleration value) may be scheduled on an untainted node.
In essence, taints on nodes will repel the pods away if the toleration doesn’t match the taint. However, nodes that do not have any taints will accept any pod (with or without toleration set on them).
Differences Between NoSchedule, PreferNoSchedule, and NoExecute
NoSchedule: Ensures that pods without the toleration will never be scheduled on the tainted node.
PreferNoSchedule: Indicates a preference to avoid scheduling pods without the toleration on the tainted node, but it is not enforced.
NoExecute: Applies to both new and already running pods. Pods without the toleration will be evicted if they are running and will not be scheduled if they are new.
To delete a taint from a node, put -
at last of the previous command
kubectl taint node node1 key1=value1:NoSchedule-
NodeSelector
nodeSelector
is the simplest form of node selection constraint in Kubernetes. It is used to specify a key-value pair that must match the labels on a node for a Pod to be scheduled on that node.
Characteristics:
Simple and straightforward to use.
Only supports equality-based requirements.
It is a hard constraint, meaning if no node matches the specified labels, the Pod will remain unscheduled.
Example
Create a YAML file named pod2.yaml
with the following content:
apiVersion: v1
kind: Pod
metadata:
labels:
run: redis
name: redis-new
spec:
containers:
- image: redis
name: redis-new
nodeSelector:
disktype: "ssd"
Apply this YAML file and you will see that the pod will be in a pending state because the pod will search for a label disktype=ssd
on the nodes.
When you describe the pod using kubectl describe pod <pod-name>
, it will provide more information.
Now, label one of the worker nodes with disktype=ssd
:
kubectl label node <node-name> disktype=ssd
You will see that the pod will start running on the node with the disktype=ssd
label.
If you want to Unlabeled the node, then put -
at the last of the command
kubectl label node <node-name> disktype-
Node Affinity
Node Affinity is a feature in Kubernetes that allows you to constrain which nodes your Pods are eligible to be scheduled on based on node labels. It provides more flexible and expressive ways to influence Pod placement compared to nodeSelector
.
Characteristics:
More expressive and flexible than
nodeSelector
.Supports a broader range of operators (e.g., In, NotIn, Exists, DoesNotExist).
Can define both hard and soft constraints.
Types of Node Affinity:
requiredDuringSchedulingIgnoredDuringExecution: This type of node affinity is a hard requirement. The pod will only be scheduled on nodes that meet the specified affinity rules. If no such node is available, the pod will not be scheduled.
preferredDuringSchedulingIgnoredDuringExecution: This type of node affinity is a soft preference. The pod will prefer to be scheduled on nodes that meet the specified affinity rules, but it can still be scheduled on nodes that do not meet these rules if no preferred nodes are available.
Node Affinity Rules
Node affinity rules are defined using nodeAffinity
within the pod specification. These rules use node labels to determine where the pod should be scheduled. There are three main components to these rules:
nodeSelectorTerms: A list of node selector terms, each containing a list of match expressions.
matchExpressions: These are the actual conditions that need to be met. Each expression consists of three parts:
key
: The label key that the rule applies to.operator
: The relationship between the key and values. Common operators includeIn
,NotIn
,Exists
, andDoesNotExist
.values
: The list of values associated with the key (only used withIn
andNotIn
operators).
Example
Using requiredDuringSchedulingIgnoredDuringExecution
Create a file named
affinity.yaml
and paste the following content in it:apiVersion: v1 kind: Pod metadata: labels: run: redis name: redis1 spec: containers: - image: redis name: redis1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd
In the above YAML file, you see that we add an affinity section which has a label
disktype
and operationIn
and it may have multiple values. For this example, we will take onlyssd
.Now, when you apply this file, you will see that your pod is in a pending state because no node has the
disktype
label.But when you label any node, you will see that the pod is running on that node that has the label.
So, in
requiredDuringSchedulingIgnoredDuringExecution
, the Pod will only be scheduled on nodes that match the specified criteria.Using preferredDuringSchedulingIgnoredDuringExecution
Create a file
affinity2.yaml
and paste the following content in it:apiVersion: v1 kind: Pod metadata: labels: run: redis name: redis2 spec: containers: - image: redis name: redis2 affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: disktype operator: In values: - hdd
In the above YAML file, we use a label
disktype
with a valuehdd
. Now in this case, the pod will first search for a node that has the same label. If it doesn't find this label, it will schedule on any of the available nodes.Apply this file:
kubectl apply -f affinity2.yaml
In this case, you will see that our
redis2
pod is running on a worker node that has no label.So, in
preferredDuringSchedulingIgnoredDuringExecution
, the scheduler will try to place the Pod on nodes that match the criteria, but it is not mandatory.
In both parts, IgnoredDuringExecution
is common. This means that if you unlabel any of the nodes, it will not affect the running pod; it will affect the new pods. The pod will remain in the running state.
Important points to remember about Node Affinity
Nodes are labeled.
Affinity is a property on a pod specified in the pod specification/manifest file.
Pods that have an affinity specified will be scheduled on the nodes that are labeled with the same value.
A pod that does not have affinity specified might get scheduled on any nodes irrespective of whether the nodes are labeled.
In essence, node affinity is a property on a pod that attracts it to a labeled node with the same value. However, pods that do not have any affinity specified might get scheduled on any nodes irrespective of whether the nodes are labeled.
Combination of Taint, Toleration and Node Affinity
Suppose we have these nodes with specified labels and taints and some pods also with some affinity and tolerations. Our aim is that each pod with toleration and affinity should be scheduled in its matching taint and labels.
Now if we talk about first pod with toleration
color=green
, it first checks all node with same taint and schedule in it, so it may have possibility that it may schedule inNode Green
andNode3
also because it has no taint so it can schedule pods.Similarly, Pod with toleration
color=blue
has possibility of scheduled inNode2 Blue
andNode3
also.
But our aim is not satisfying with these conditions. So, to prevent this problem we use Taint, Toleration + Node Affinity
in combined form.
Conclusion
Often, one of Taints and Tolerations or Node Affinity might be enough to schedule the pods on the nodes of our choice. But if your requirement is complex, consider applying both concepts.