Troubleshooting Worker Plane Failure

This blog focuses on troubleshooting Kubernetes cluster workload failures, specifically from an exam perspective. It covers common scenarios and how to identify and resolve issues.

Scenario: Worker Nodes Not Ready

This scenario involves a Kubernetes cluster where the master node is ready, but worker nodes are in a "not ready" state. This indicates a problem with the worker nodes.

NOTE: In each image if there is symbol of M means it is a MASTER NODE, if there is W1 means it is a Worker01 and similar W2 for Worker02.

Symptom: Worker nodes are not reporting as ready.
Possible cause:
- Issues with the worker nodes themselves, such as hardware failures or software problems.
- Network connectivity issues between the worker nodes and the master node.
- Network add-ons, such as Calico, Flannel, or Weave, may not be installed or configured correctly.

Troubleshoot Worker Node - 1

Network add-ons are essential for pod-to-pod communication and overall cluster functionality. If they are not installed or configured correctly, it can lead to workload failures.

Check if the network add-on is installed: Use kubectl get pods -A to list all pods in the cluster. Look for pods related to the network add-on (e.g., Calico, Flannel, Weave).
Verify the network add-on is running: Check the status of the network add-on pods. They should be in a "Running" state.

But through this we can get confused due to so much of pods, so let’s filters pods by namespace.

First list all namespace by kubectl get ns

From this image we can conclude, that in our system calico CNI is there with trigera operator.
Then check pod status under each namespace of CNI add-on

Everything is fine in Network Plugins as we can see from the image.
You can also Check by kubectl get pods -n kube-system | grep calico: This command lists all pods related to Calico.
Checking Configuration Files
- /etc/cni/net.d: This directory contains configuration files for network add-ons.
10-calico.conf.list: This file contains the configuration for the Calico network add-on.
calico-kube-config.json: This file contains the configuration for the Calico network add-on.

If you have any other plugins like flannel, weavenet; its files also look similar to this

Now these files exist means there is no error in network configuration, so let’s SSH into worker node and check there for any problem.

SSH into Worker Node

After establishing the SSH connection, verify that the worker node is accessible.

Kubectl Configuration: Ensure that the kubectl configuration is set up correctly. This allows you to interact with the Kubernetes cluster.
kubectl get pods: Use this command to check if any pods are running on the worker node.
kubelet: A node-level agent responsible for:
- Reporting the status of the worker node to the Kubernetes API server.
- Establishing communication between the worker node and the control plane node.
- Executing tasks on the worker node.
Status Check: To check the status of kubelet, use the command service kubelet status.
Logs: To view the kubelet logs, use the following steps:
- Press Shift + G to jump to the last line of the log file.
- Use the arrow keys to navigate to the right side of the log file.

Active as inactive.
Code as exited.

From this we can conclude that our kubelet service is not running, that’s why this error is coming.

Restarting kubelet: To restart the kubelet service, use the command sudo service kubelet start.
Verifying kubelet Status: After restarting kubelet, run service kubelet status again to confirm that it is running.
Control Plane Node Verification: Return to the control plane node and use kubectl get pods to verify that the worker node is reporting a healthy status.

The Workflow is like this:

Troubleshoot Worker Node - 2

SSH into the Worker Node-2

First check the status of Kubelet Service

KUBELET Service Status: The KUBELET service is not active and is in the "activating" state. This indicates that something is preventing the service from starting properly.
KUBELET Logs: To investigate the issue further, we need to check the KUBELET logs.
Checking KUBELET Logs
- Journalctl Command: The journalctl command is used to check the logs of services running on the system.
- KUBELET Log Command: To view the KUBELET logs, use the command journalctl -u kubelet.
- Navigating Logs: Use the shift + G key combination to jump to the last line of the log file.
- Log Types: Logs are categorized by type:
  - I: Info messages
  - E: Error messages
Debugging the Error
- Error Message: The error message indicates that the KUBELET service failed to construct its dependencies. Specifically, it was unable to locate the client CA file at /etc/kubernetes/pki/.
- Cause: The error suggests that the file path specified in the KUBELET configuration is incorrect.

Checking the KUBELET Configuration File
- KUBELET Configuration File: The kubelet configuration file is located at /var/lib/kubelet/config.yaml.
  
  Note: If you think that how can we know what the path for conf file is, you can check from service status command only.
- File Path Verification: Check the client CA file path within the config.yaml file to ensure it points to the correct location.
Confirm file name: Usually all crt file are stored in /etc/kubernetes/pki/ directory to verify the existence of the client CA file check the file.
- kubelet.conf - This file contains the configuration for the kubelet service.
- pki/ca.crt - This file contains the Certificate Authority (CA) certificate.
Change the file name in the config.yaml with correct CA file which is ca.crt.
Steps:
1. Check the kubelet configuration: Examine the kubelet.conf file for any errors or misconfigurations.
2. Verify the CA certificate: Ensure the pki/ca.crt file is present and valid.
3. Restart the kubelet service: Use the command sudo service kubelet restart to restart the kubelet service.
4. Check the status: After restarting, use sudo service kubelet status to verify that the service is running correctly.
5. Check the worker node status: Run kubectl get nodes to confirm that the worker node is now healthy.

Key Takeaways

Finding the Root Cause

Exam-Oriented Approach: Focus on understanding the common issues related to kubelet configuration and how to troubleshoot them.

Key Concepts:

kubelet Configuration: The kubelet service is responsible for managing the Kubernetes nodes. Its configuration file (kubelet.conf) defines how it interacts with the cluster.
kubelet Service Details: The kubelet service details provide information about the configuration files it uses, including the kubelet.conf file and the kube-config file.

Troubleshooting Steps:

Identify the Error: Determine the specific error message or issue you are encountering.
Locate the Relevant File: Use the kubelet service details to find the configuration file associated with the error.
Examine the File: Inspect the file for any errors or misconfigurations.
Fix the Issue: Correct the error or misconfiguration in the file.
Restart the kubeletService: Restart the kubelet service to apply the changes.

To restart kubelet you can follow these commands also

 sudo systemctl daemon-reload
 sudo systemctl restart kubelet

Troubleshooting Worker Plane Failure - CKA

Scenario: Worker Nodes Not Ready

Troubleshoot Worker Node - 1

SSH into Worker Node

Troubleshoot Worker Node - 2

Key Takeaways