This blog focuses on troubleshooting Kubernetes cluster workload failures, specifically from an exam perspective. It covers common scenarios and how to identify and resolve issues.
Scenario: Worker Nodes Not Ready
This scenario involves a Kubernetes cluster where the master node is ready
, but worker nodes are in a "not ready
" state. This indicates a problem with the worker nodes.
NOTE: In each image if there is symbol of M means it is a MASTER NODE, if there is W1 means it is a Worker01 and similar W2 for Worker02.
Symptom: Worker nodes are not reporting as ready.
Possible cause:
Issues with the worker nodes themselves, such as hardware failures or software problems.
Network connectivity issues between the worker nodes and the master node.
Network add-ons, such as Calico, Flannel, or Weave, may not be installed or configured correctly.
Troubleshoot Worker Node - 1
Network add-ons are essential for pod-to-pod communication and overall cluster functionality. If they are not installed or configured correctly, it can lead to workload failures.
Check if the network add-on is installed: Use
kubectl get pods -A
to list all pods in the cluster. Look for pods related to the network add-on (e.g., Calico, Flannel, Weave).Verify the network add-on is running: Check the status of the network add-on pods. They should be in a "Running" state.
But through this we can get confused due to so much of pods, so let’s filters pods by namespace.
First list all namespace by
kubectl get ns
From this image we can conclude, that in our system
calico CNI
is there withtrigera operator
.Then check pod status under each namespace of CNI add-on
Everything is fine in Network Plugins as we can see from the image.
You can also Check by
kubectl get pods -n kube-system | grep calico
: This command lists all pods related to Calico.Checking Configuration Files
/etc/cni/net.d
: This directory contains configuration files for network add-ons.
10-calico.conf.list
: This file contains the configuration for the Calico network add-on.calico-kube-config.json
: This file contains the configuration for the Calico network add-on.
If you have any other plugins like flannel
, weavenet
; its files also look similar to this
Now these files exist means there is no error in network configuration, so let’s SSH into worker node and check there for any problem.
SSH into Worker Node
After establishing the SSH connection, verify that the worker node is accessible.
Kubectl Configuration: Ensure that the
kubectl
configuration is set up correctly. This allows you to interact with the Kubernetes cluster.kubectl get pods
: Use this command to check if any pods are running on the worker node.kubelet
: A node-level agent responsible for:Reporting the status of the worker node to the Kubernetes API server.
Establishing communication between the worker node and the control plane node.
Executing tasks on the worker node.
Status Check: To check the status of
kubelet
, use the commandservice kubelet status
.Logs: To view the
kubelet
logs, use the following steps:Press
Shift + G
to jump to the last line of the log file.Use the arrow keys to navigate to the right side of the log file.
Active
asinactive
.Code
asexited
.
From this we can conclude that our kubelet service is not running, that’s why this error is coming.
Restarting
kubelet
: To restart thekubelet
service, use the commandsudo service kubelet start
.Verifying
kubelet
Status: After restartingkubelet
, runservice kubelet status
again to confirm that it is running.Control Plane Node Verification: Return to the control plane node and use
kubectl get pods
to verify that the worker node is reporting a healthy status.
The Workflow is like this:
Troubleshoot Worker Node - 2
SSH into the Worker Node-2
First check the status of Kubelet
Service
KUBELET Service Status: The KUBELET service is not active and is in the "activating" state. This indicates that something is preventing the service from starting properly.
KUBELET Logs: To investigate the issue further, we need to check the KUBELET logs.
Checking KUBELET Logs
Journalctl Command: The
journalctl
command is used to check the logs of services running on the system.KUBELET Log Command: To view the KUBELET logs, use the command
journalctl -u kubelet
.Navigating Logs: Use the
shift + G
key combination to jump to the last line of the log file.Log Types: Logs are categorized by type:
I: Info messages
E: Error messages
Debugging the Error
Error Message: The error message indicates that the KUBELET service failed to construct its dependencies. Specifically, it was unable to locate the client CA file at
/etc/kubernetes/pki/
.Cause: The error suggests that the file path specified in the KUBELET configuration is incorrect.
Checking the KUBELET Configuration File
KUBELET Configuration File: The kubelet configuration file is located at
/var/lib/kubelet/config.yaml
.Note: If you think that how can we know what the path for conf file is, you can check from service status command only.
File Path Verification: Check the
client CA file
path within theconfig.yaml
file to ensure it points to the correct location.
Confirm file name: Usually all crt file are stored in
/etc/kubernetes/pki/
directory to verify the existence of the client CA file check the file.kubelet.conf
- This file contains the configuration for thekubelet
service.pki/ca.crt
- This file contains the Certificate Authority (CA) certificate.
Change the file name in the
config.yaml
with correct CA file which isca.crt
.Steps:
Check the
kubelet
configuration: Examine thekubelet.conf
file for any errors or misconfigurations.Verify the CA certificate: Ensure the
pki/ca.crt
file is present and valid.Restart the
kubelet
service: Use the commandsudo service kubelet restart
to restart thekubelet
service.Check the status: After restarting, use
sudo service kubelet status
to verify that the service is running correctly.Check the worker node status: Run
kubectl get nodes
to confirm that the worker node is now healthy.
Key Takeaways
Finding the Root Cause
Exam-Oriented Approach: Focus on understanding the common issues related to kubelet
configuration and how to troubleshoot them.
Key Concepts:
kubelet
Configuration: Thekubelet
service is responsible for managing the Kubernetes nodes. Its configuration file (kubelet.conf
) defines how it interacts with the cluster.kubelet
Service Details: Thekubelet
service details provide information about the configuration files it uses, including thekubelet.conf
file and thekube-config
file.
Troubleshooting Steps:
Identify the Error: Determine the specific error message or issue you are encountering.
Locate the Relevant File: Use the
kubelet
service details to find the configuration file associated with the error.Examine the File: Inspect the file for any errors or misconfigurations.
Fix the Issue: Correct the error or misconfiguration in the file.
Restart the
kubelet
Service: Restart thekubelet
service to apply the changes.To restart
kubelet
you can follow these commands alsosudo systemctl daemon-reload sudo systemctl restart kubelet