This Blog focuses on common control plane failure scenarios relevant to the Certified Kubernetes Administrator (CKA) exam to understand how to troubleshoot and fix control plane issues in a Kubernetes cluster.
Scenario 1: Connection Refused to the Kubernetes API Server
Scenario: A Kubernetes cluster (one control plane node and two worker nodes) is in a broken state.
Troubleshooting Steps:
Verify Node Status: Run the command
kubectl get nodes
to check the status of the nodes in the cluster.Expected out -
Error -
Identify the Issue: If the command returns an error indicating a connection refusal to the Kubernetes API server, it suggests a problem with the API server itself.
Understand the Kubernetes Architecture: Recall the Kubernetes architectural diagram, which shows the API server as the first point of contact for all client interactions.
Check API Server Process: Since the API server is a static pod running as a container, you would typically use
docker ps
to check if it's running. However, in Kubernetes versions 1.24 and later,containerd
is the default runtime, replacing Docker.Use
crictl
Command: Use thecrictl
command (client forcontainerd
) to check running containers. The commandcrictl ps
will list running containers.But for this image, we can see no
API-SERVER
container is running.Inspect Container Logs: If the API server container is not running, use
crictl ps -a
to list exited containers and inspect their logs for clues about the failure.We can see that our
kube-apiserver
is exited a minute ago.
Now, we have to check manifests of kube-apiserver
to find any issue in it.
Location of Manifest: The API server manifest is located in the
/etc/kubernetes/manifests
directory, which is the default directory for static pods.Accessing the Manifest: Use
sudo vi kube-apiserver.yaml
to open the manifest file. The file is owned by the root user, so you need to usesudo
to edit it.Inspecting the Manifest: Review the manifest for any errors or configuration issues that might be causing the API server to fail.
Checking Logs: Use the
crictl logs <container_id>
command to view the logs of the exited API server container. The logs might provide insights into the cause of the failure.Issue: The container status shows "runtime failed" and the container ID is not found.
There is default log directory in control plane which is located at /var/log
, you can check from their also.
Check Logs: Examine the default log directory (
/var/log/containers
) on the control plane node.Look for Logs of API-SERVER: Verify the existence of logs for the
kube-apiserver
in theContainers
directory. If the logs are missing, it indicates the container exited and the logs were deleted.You can grep also like
ls | grep apiserver
to filter the logs of apiserver.Inspect Manifest: Review the
kube-apiserver.yaml
manifest file in the/etc/kubernetes/manifests
directory. Ensure the content and command is correct and theadvertiseAddress
matches the node IP address where thekube-apiserver
is running. Confirm that theexportPort
(--secure-port
) is set to6443
, which is the secure port for the kube-apiserver.On Inspecting the file, we find that the content of
kube-apiserver
is incorrect, the first command is spelled wrong with an extrar
in- kube-apiserverr
.Change
- kube-apiserverr
to- kube-apiserver
and verify other thing also.Now everything is fine so let’s save it and check if container is running or not.
Run
crictl ps
to check the container statusNow our
kube-apiserver
is running fine, let’s check now if our nodes are healthy and running by commandkubectl get nodes
.
Scenario 2: Kube-Apiserver Running but kubectl
Throws Errors
Issue: The kube-apiserver
is running, but kubectl
commands still fail.
Troubleshooting:
Verify Kubeconfig: Ensure the correct
kubeconfig
file is being used. The defaultkubeconfig
file is located in$HOME/.kube.
Check Environment Variable: Confirm that the `KUBECONFIG` environment variable is set correctly. Network Issues: Investigate potential network connectivity problems.
Kubeconfig File: If you are not the Kubernetes administrator, request a `kubeconfig` file from the administrator.
Default Kubeconfig: Check the default `
kubeconfig
` file in the `$HOME/.kube
Check if kube-apiserver is running or not
crictl ps
We can see that our
kube-apiserver
is up and runningAdmin Kubeconfig: Examine the
admin.conf
file in the/etc/kubernetes
directory and confirm that exists or not.Permissions: Ensure the
admin.conf
file has appropriate permissions (e.g.,775
).So, what you can copy this file in the default location of
kubeconfig
and set the environment variables to this file and then check bykubectl get nodes
.
Ensure the kubectl
command is using the correct configuration file. This file points to the Kubernetes API server, which is essential for interacting with the cluster.
Key Points:
The
kubectl
command relies on a configuration file to connect to the Kubernetes API server.Verify the file path specified in the configuration file is correct. If the configuration file is incorrect,
kubectl
commands will fail to connect to the cluster.
Scenario 3: Kubernetes Pod Scheduling Troubleshooting
Issue: While scheduling the pods, it continuously remains in pending state and not scheduling.
Key Points:
The Kubernetes API server is the central control point for the cluster. It handles requests from
kubectl
and other tools.Ensure the API server is running and accessible.
Let’s make a pod through kubectl run nginx --image=nginx
and check if it schedules or not
The pod is stuck in the Pending
state, it indicates a scheduling issue.
Troubleshooting:
Investigating Pod Status: Use
kubectl describe
to gather information about the pod's status, its events and node assignment.From the image we can see that the pod is not assigned to a node, it suggests a scheduling problem.
Identifying the Scheduling Component: The Kubernetes scheduler is responsible for assigning pods to nodes.
The scheduler evaluates pod requirements and node resources to determine the best fit.
If the scheduler is not functioning correctly, pods may fail to be scheduled.
Verifying Scheduler Health: Check the status of the scheduler pod using
kubectl get pods -n kube-system
.If the scheduler pod is not healthy, it may be the cause of scheduling issues.
Examine the scheduler pod's logs using
kubectl logs kube-scheduler-master -n kube-system
to identify potential problems.Image Pull Errors: Through logs we found the scheduler pod may encounter errors pulling the image required for its operation.
Examine the scheduler pod's events using
kubectl describe pod kube-scheduler-master -n kube-system
to identify the specific error.The error message may indicate an incorrect image name or tag.
Updating the Scheduler Image: Correct the image name or tag in the scheduler's manifest file.
Check the correct image tag by manifest of
kube-apiserver
.Let’s update the image tag in
kube-scheduler.yaml
manifest also by commandsudo vi kube-scheduler.yaml
After updating the scheduler's image, it may take some time for the pod to be scheduled. Check by
kubectl get pod -n kube-system
Use
kubectl get pods
to monitor the pod's status. The pod should eventually transition from thePending
state to theRunning
state.Troubleshooting Pod Scheduling
Understand the common causes of pod scheduling issues.
Key Points:
Insufficient Resources: The node may not have enough resources (CPU, memory) to accommodate the pod's requirements.
Node Affinity/Anti-affinity: Pod affinity and anti-affinity rules may prevent the pod from being scheduled on certain nodes.
Pod Disruption Budgets: Pod disruption budgets may prevent the scheduler from deleting pods, even if they are unhealthy.
Network Connectivity: Network connectivity issues between the scheduler and nodes can hinder pod scheduling.
Scenario 4: Issue with Controller Manager
Issue: Suppose we have create a deployment which creates a replicaset of 2 pods of nginx. When a pod is deleted, the scheduler will attempt to create a replacement pod.
Now let’s delete a pod
If a Pod is deleted, the Replica Set will automatically create a new Pod to maintain the desired number of replicas.
Controller Manager and Pod Creation
Controller Manager: A Kubernetes component responsible for ensuring that the desired state of the cluster matches the actual state.
Controller Manager's Role: The Controller Manager monitors the state of Pods and ensures that the desired number of replicas are running.
Troubleshooting:
Diagnosing Controller Manager Issues: The Controller Manager is not creating new Pods, indicating a potential issue with the Controller Manager itself.
Checking Controller Manager Status: The command
kubectl get pods -n kube-system
shows that the Controller Manager is in aCrashLoopBackOffState
.Investigating Logs: The command
kubectl logs -n kube-system <controller-manager-pod-name>
shows no logs, indicating a potential issue with the Controller Manager's configuration.Describing the Pod: The command
kubectl describe pod <controller-manager-pod-name> -n kube-system
reveals that the Controller Manager is unable to start the container process because the executable file is not found.Inspect the controller-manager manifests: Go to /etc/kubernetes/manifests and open the
kube-controller-manager.yaml
file in editor mode and check for any error.From the image we can see that
kube-controller-manager
command is spelled wrong in the file.Correcting the Command: The command in the Controller Manager's configuration file needs to be corrected.
Restarting the Controller Manager: After correcting the command, the Controller Manager should restart and start the pod in
kube-system
namespace. Check bykubectl get pods -n kube-system
.Check the nginx pod status by
kubectl get pods
Scenario 5: Issue with Scaling the Deployment
Issue: Scaled the Nginx deployment by four replicas, but the pods are not creating.
The Controller Manager is responsible for scaling the deployment by creating or deleting Pods to match the desired number of replicas. It ensures that the desired state of the cluster matches with the actual state.
Scaling Issues: The image demonstrates a scenario where the Controller Manager is unable to scale the deployment, indicating a potential issue with the Controller Manager.
Troubleshooting:
Investigating Logs: The command
kubectl logs -n kube-system <controller-manager-pod-name>
shows that the command failed, indicating a potential issue with the Controller Manager's configuration.Through image we can see that there is error in ca.crt file which is being mounted in the container through volume mount.
Understanding the Role of Volumes: Volumes are used to persist data and configuration files within containers. They allow for data sharing between containers and provide a mechanism for accessing files from the host system.
Volume Mounting: Volumes are mounted to containers using the
volumeMounts
section in the pod definition. This section specifies the host path where the volume is located and the mount path within the container.Check the controller manifests: Go to
/etc/kubernetes/manifests
and checkkube-controller-manager.yaml
file.These are volumeMounts
and these are volumes
Verifying Volume Mounts: To verify if a volume is mounted correctly, check the
volumeMounts
section in the pod definition and ensure the mount path and host path match.We can see
Path
of volume namek8s-creds
is different in volume path and host path. The error "no such file or directory" was caused by a typo in the host path. The correct path wasetc/kubernetes/pki
, but it was mistakenly written asetc/kubernetes/pk
.Resolving the Issue: After correcting the typo in the host path, the pod was restarted, and the error was resolved.
Monitoring Logs: Use the
kubectl logs
command to monitor the container logs and identify any errors or warnings.Checks the pods
Kubernetes Cluster Information and Debugging
Cluster Information:
The kubectl cluster-info
command provides information about the Kubernetes control plane, including the API server address, DNS server address, and other relevant details.
Detailed Cluster Information:
The kubectl cluster-info dump
command provides a more detailed dump of the cluster information, including logs and configuration details.
Read the Kubernetes Documents also