Understanding Taints, Tolerations, and Node Affinity in K8s☸️

In Kubernetes, efficient resource management and optimal scheduling of Pods are crucial for maintaining a well-functioning cluster. However, simply relying on default scheduling policies is often insufficient for more complex workloads and environments. This is where Taints, Tolerations, and Node Affinity come into play. Let's explore these concepts in detail to understand how they can be effectively leveraged in a Kubernetes cluster.

Taints and Tolerations

Taints are applied to nodes and allow a node to repel a set of pods. Think of taints as a way to mark a node with specific characteristics that make it unsuitable for certain pods.

Tolerations are applied to pods and allow the pods to tolerate (i.e., be scheduled on) nodes with specific taints. Tolerations enable exceptions to the rules set by taints.

How Taints and Tolerations Work

Taints and Tolerations work together to ensure that pods are only scheduled on appropriate nodes. Here’s how they interact:

Applying Taints: A taint is added to a node to mark it with a key-value pair and an effect. This indicates that only pods with matching tolerations should be scheduled on that node.
Applying Tolerations: Tolerations are added to pods, allowing them to be scheduled on nodes with matching taints.
Scheduling Decision: The Kubernetes scheduler checks the taints on each node and the tolerations on each pod. If a pod tolerates a node’s taint, it can be scheduled on that node.

Taints

A taint is a key-value pair with an effect that is applied to a node. The key-value pair can represent any condition or attribute, and the effect determines what happens to Pods that do not tolerate the taint. There are three possible effects:

NoSchedule: The Pod will not be scheduled on the node unless it tolerates the taint.
PreferNoSchedule: The system will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not a hard requirement.
NoExecute: The Pod will be evicted if it is already running on the node and does not tolerate the taint.

Example of Applying a Taint

To apply a taint to a node, you use the kubectl taint command. For example, to taint a node named node1 with the key key1, value value1, and effect NoSchedule:

kubectl taint node node1 key1=value1:NoSchedule

we have two nodes: 1 control plane and 2 worker node. We taint the both worker nodes with a gpu=true key-value pair. When we create a pod, it shows a pending state.

When we describe our pod using below command

kubectl describe pod/<pod-name>

The reason behind our pod not being scheduled on the node is that our control node has the taint {node-role.kubernetes.io/control-plane: }, meaning only control-component pods will schedule on this node. Our worker nodes have the taint {gpu=true} that we specified, and these nodes are looking for a pod with the toleration {gpu=true}.

Tolerations

A toleration is applied to a Pod to indicate that it can tolerate specific taints. This is done by adding a toleration section to the Pod's specification.

Example of Applying a Toleration

Let's apply toleration to our pod. Create a file pod1.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: redis
  name: redis
spec:
  containers:
    - image: redis
      name: redis
  tolerations:
    - key: "gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"

In this YAML file, we have made a toleration on our redis pod with a key-value pair of gpu=true. When we apply this file (kubectl apply -f pod1.yaml), this pod will schedule on an untainted node or a tainted node with a key-value pair of {gpu=true} and the pod will be in running status.

Important Points to Remember about Taints and Tolerations

Taints are set on Nodes.
Tolerations are set on Pods.
Tainted nodes will only accept pods that have a similar toleration set.
A pod (with or without a particular toleration value) may be scheduled on an untainted node.

In essence, taints on nodes will repel the pods away if the toleration doesn’t match the taint. However, nodes that do not have any taints will accept any pod (with or without toleration set on them).

Differences Between NoSchedule, PreferNoSchedule, and NoExecute

NoSchedule: Ensures that pods without the toleration will never be scheduled on the tainted node.
PreferNoSchedule: Indicates a preference to avoid scheduling pods without the toleration on the tainted node, but it is not enforced.
NoExecute: Applies to both new and already running pods. Pods without the toleration will be evicted if they are running and will not be scheduled if they are new.

To delete a taint from a node, put - at last of the previous command

kubectl taint node node1 key1=value1:NoSchedule-

NodeSelector

nodeSelector is the simplest form of node selection constraint in Kubernetes. It is used to specify a key-value pair that must match the labels on a node for a Pod to be scheduled on that node.

Characteristics:

Simple and straightforward to use.
Only supports equality-based requirements.
It is a hard constraint, meaning if no node matches the specified labels, the Pod will remain unscheduled.

Example

Create a YAML file named pod2.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: redis
  name: redis-new
spec:
  containers:
    - image: redis
      name: redis-new
  nodeSelector:
    disktype: "ssd"

Apply this YAML file and you will see that the pod will be in a pending state because the pod will search for a label disktype=ssd on the nodes.

When you describe the pod using kubectl describe pod <pod-name>, it will provide more information.

Now, label one of the worker nodes with disktype=ssd:

kubectl label node <node-name> disktype=ssd

You will see that the pod will start running on the node with the disktype=ssd label.

If you want to Unlabeled the node, then put - at the last of the command

kubectl label node <node-name> disktype-

Node Affinity

Node Affinity is a feature in Kubernetes that allows you to constrain which nodes your Pods are eligible to be scheduled on based on node labels. It provides more flexible and expressive ways to influence Pod placement compared to nodeSelector.

Characteristics:

More expressive and flexible than nodeSelector.
Supports a broader range of operators (e.g., In, NotIn, Exists, DoesNotExist).
Can define both hard and soft constraints.

Types of Node Affinity:

requiredDuringSchedulingIgnoredDuringExecution: This type of node affinity is a hard requirement. The pod will only be scheduled on nodes that meet the specified affinity rules. If no such node is available, the pod will not be scheduled.
preferredDuringSchedulingIgnoredDuringExecution: This type of node affinity is a soft preference. The pod will prefer to be scheduled on nodes that meet the specified affinity rules, but it can still be scheduled on nodes that do not meet these rules if no preferred nodes are available.

Node Affinity Rules

Node affinity rules are defined using nodeAffinity within the pod specification. These rules use node labels to determine where the pod should be scheduled. There are three main components to these rules:

nodeSelectorTerms: A list of node selector terms, each containing a list of match expressions.
matchExpressions: These are the actual conditions that need to be met. Each expression consists of three parts:
- key: The label key that the rule applies to.
- operator: The relationship between the key and values. Common operators include In, NotIn, Exists, and DoesNotExist.
- values: The list of values associated with the key (only used with In and NotIn operators).

Example

Using requiredDuringSchedulingIgnoredDuringExecution

Create a file named affinity.yaml and paste the following content in it:
```
  apiVersion: v1
  kind: Pod
  metadata:
    labels:
      run: redis
    name: redis1
  spec:
    containers:
      - image: redis
        name: redis1
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: disktype
                  operator: In
                  values:
                    - ssd
```
In the above YAML file, you see that we add an affinity section which has a label disktype and operation In and it may have multiple values. For this example, we will take only ssd.

Now, when you apply this file, you will see that your pod is in a pending state because no node has the disktype label.

But when you label any node, you will see that the pod is running on that node that has the label.

So, in requiredDuringSchedulingIgnoredDuringExecution, the Pod will only be scheduled on nodes that match the specified criteria.

Using preferredDuringSchedulingIgnoredDuringExecution

Create a file affinity2.yaml and paste the following content in it:

  apiVersion: v1
  kind: Pod
  metadata:
    labels:
      run: redis
    name: redis2
  spec:
    containers:
      - image: redis
        name: redis2
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
                - key: disktype
                  operator: In
                  values:
                    - hdd

In the above YAML file, we use a label disktype with a value hdd. Now in this case, the pod will first search for a node that has the same label. If it doesn't find this label, it will schedule on any of the available nodes.

Apply this file:

  kubectl apply -f affinity2.yaml

In this case, you will see that our redis2 pod is running on a worker node that has no label.

So, in preferredDuringSchedulingIgnoredDuringExecution, the scheduler will try to place the Pod on nodes that match the criteria, but it is not mandatory.

In both parts, IgnoredDuringExecution is common. This means that if you unlabel any of the nodes, it will not affect the running pod; it will affect the new pods. The pod will remain in the running state.

Important points to remember about Node Affinity

Nodes are labeled.
Affinity is a property on a pod specified in the pod specification/manifest file.
Pods that have an affinity specified will be scheduled on the nodes that are labeled with the same value.
A pod that does not have affinity specified might get scheduled on any nodes irrespective of whether the nodes are labeled.

In essence, node affinity is a property on a pod that attracts it to a labeled node with the same value. However, pods that do not have any affinity specified might get scheduled on any nodes irrespective of whether the nodes are labeled.

Combination of Taint, Toleration and Node Affinity

Suppose we have these nodes with specified labels and taints and some pods also with some affinity and tolerations. Our aim is that each pod with toleration and affinity should be scheduled in its matching taint and labels.

Now if we talk about first pod with toleration color=green, it first checks all node with same taint and schedule in it, so it may have possibility that it may schedule in Node Green and Node3 also because it has no taint so it can schedule pods.
Similarly, Pod with toleration color=blue has possibility of scheduled in Node2 Blue and Node3 also.

But our aim is not satisfying with these conditions. So, to prevent this problem we use Taint, Toleration + Node Affinity in combined form.

Conclusion

Often, one of Taints and Tolerations or Node Affinity might be enough to schedule the pods on the nodes of our choice. But if your requirement is complex, consider applying both concepts.

Understanding Taints, Tolerations, and Node Affinity in K8s☸️ - CKA