Troubleshooting Kubernetes Cluster Issues

A comprehensive runbook to check and alert on common issues with Kubernetes clusters

k8s kubernetes cluster troubleshooting

1XWfKcNE87NwhcbVVVDzjCluster not ready
1
Cluster not ready
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Check if a cluster is in a ready state. Task returns a list of nodes that are no in a ready state. If all nodes are in a ready state then an empty list is returned and no further checks in the sub tasks are performed.
inputs
required
outputs
cmd = 'kubectl get nodes -o wide | grep -v "master\|VERSION"' op = _exe(master_ip, cmd) lines = op.split('\n') #print(lines) context.skip_sub_tasks = True problem_nodes = [] for line in lines: if line: status = line.split()[1] nodename = line.split()[0] nodeip = line.split()[5] #print(status) if status.lower() == "notready": _problem = True context.log("ERROR", f"Pod not ready") problem_nodes.append({"nodename" : nodename, "nodeip": nodeip}) context.skip_sub_tasks = False if not context.skip_sub_tasks: print(problem_nodes)
copied
Kubernetes cluster
1
1.1INm7qhYqWS656eMY9lHgCheck if the node went down
1.1
Check if the node went down
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
For nodes that are in a non-ready state, checks which of them went down by checking for their Node IPs
inputs
required
required
outputs
cmd = "aws ec2 describe-instances" cmd += " --filters Name=instance-state-name,Values=running" cmd += " --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,PublicIpAddress,PrivateIpAddress,Tags[?Key==`Name`].Value | [0]]'" cmd += " --output text" op = _exe(None, cmd) #print(op) lines = op.split() context.skip_sub_tasks = True down_nodes = [] new_problem_nodes = [] for node in problem_nodes: if node['nodeip'] not in op: context.log("WARNING", "nodeip missing") context.skip_sub_tasks = False down_nodes.append(node) else: for line in lines: if node['nodeip'] in line: instance_id = line.split()[0] node['instance_id'] = instance_id new_problem_nodes.append(node) print(down_nodes) print(new_problem_nodes) problem_nodes = new_problem_nodes if len(problem_nodes) > 0: context.skip_sub_tasks = False
copied
Kubernetes node cluster
1.1
1.1.1t4N0rd4xF38AXoa1g7pnkubelet issue
1.1.1
kubelet issue
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Runs checks to see node readyness issues are due to the Kubelet. Some kubelet issues include unresponsive kubelets, dead kubelets or kubelets that have exited soon.
inputs
required
required
outputs
kubelet Kubernetes
1.1.1
1.1.1.1lWiJY2yVCuuRomG5HSN5Kubelet stopped posting node status
1.1.1.1
Kubelet stopped posting node status
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Returns nodes where kubelet had stopped posting status updates.
inputs
required
required
outputs
context.skip_sub_tasks = True no_status_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename}" op = _exe(master_ip, cmd) if "kubelet stopped posting node status" in op.lower(): context.skip_sub_tasks = False no_status_nodes.append(node) print(no_status_nodes)
copied
1.1.1.1
1.1.1.1.1R6CHVIIkBPQtZfWnDAdBKubelet Dead
1.1.1.1.1
Kubelet Dead
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Identifies nodes where the Kubelet is in an inactive or dead state.
inputs
required
outputs
dead_kubelet_nodes = [] for node in no_status_nodes: instance_id = node['instance_id'] cmd = "sudo systemctl status kubelet | grep 'Active:'" op = _exe(instance_id, cmd) if "inactive (dead)" in op: dead_kubelet_nodes.append(node) context.skip_sub_tasks = len(dead_kubelet_nodes) == 0 print(dead_kubelet_nodes)
copied
kubelet Kubernetes
1.1.1.1.1
1.1.1.1.2IbAsoCsZhM2ewyCX7GaVKubelet exited
1.1.1.1.2
Kubelet exited
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Identifies kubelets that are in active but exited state.
inputs
required
required
outputs
exited_kubelet_nodes = [] for node in no_status_nodes: instance_id = node['instance_id'] cmd = "sudo systemctl status kubelet | grep 'Active:'" op = _exe(instance_id, cmd) if "active (exited)" in op: exited_kubelet_nodes.append(node) context.skip_sub_tasks = len(exited_kubelet_nodes) == 0 print(exited_kubelet_nodes)
copied
kubelet Kubernetes
1.1.1.1.2
1.1.1.1.3NgnZ2L0OqiMQIpG5LCfWWorker kubelet unable to update master
1.1.1.1.3
Worker kubelet unable to update master
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Gets the master instance IP addresses in the cluster and uses it to perform connectivity checks.
inputs
required
required
outputs
for node in no_status_nodes: nodeip = node['nodeip'] cmd = "sudo systemctl status kubelet" op = _exe(nodeip, cmd) #_problem = False if "Unable to register node with API server" in op: pass master_ip_addr = _get_ip_addr(master_ip) cmd1 = f'aws ec2 describe-instances --filters "Name=ip-address,Values={master_ip_addr}" --query "Reservations[*].Instances[*].InstanceId" --output text' master_instance_ids = [_exe(None, cmd1).strip()] print(master_instance_ids)
copied
Kubernetes kubelet
1.1.1.1.3
1.1.1.1.3.1xSYmJ2ECr0PZ9k9aTbFvSecurity group blocked the API request
1.1.1.1.3.1
Security group blocked the API request
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Looks at the security groups of the master instances (by their IDs) to check if there are possible port configuration mismatches preventing connectivity.
inputs
required
required
outputs
import json for instance_id in master_instance_ids: _problem = True cmd = f"aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[*].Instances[*].SecurityGroups[*].GroupId' --output=text" sg_ids1 = _exe(None, cmd) print(sg_ids1) sg_ids = re.split('\s',sg_ids1.strip()) if sg_ids: for sg_id in sg_ids: if not sg_id: continue cmd1 = 'aws ec2 describe-security-groups --filter Name=group-id,Values=' cmd1+= sg_id cmd1+= ' --query SecurityGroups[*].IpPermissions[*]' op = _exe(None, cmd1) json_op = json.loads(op) for sg in json_op: for rule in sg: if 'FromPort' in rule: port_lo = int(rule['FromPort']) port_hi = port_lo if 'ToPort' in rule: port_hi = int(rule['ToPort']) if port >= port_lo and port <= port_hi: _problem = False else: continue if _problem: break if _problem: context.log("ERROR", "Found problem")
copied
AWS
1.1.1.1.3.1
1.1.2N4hnHypnj6NxnlLCWAS5Kube-proxy issue
1.1.2
Kube-proxy issue
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
inputs
required
required
outputs
kube-proxy Kubernetes
1.1.2
1.1.2.1qwIuEDWW3beQ089pLlfkkube-proxy not running
1.1.2.1
kube-proxy not running
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
inputs
required
required
outputs
cmd = f"kubectl get pods -n kube-system -o wide | grep kube-proxy | grep Running" op = _exe(master_ip, cmd) kube_proxy_down_nodes = [] for node in problem_nodes: nodeip = node['nodeip'] if nodeip not in op: context.log("ERROR", "Found problem node. nodeip missing") kube_proxy_down_nodes.append(node) print("Down Nodes: ", kube_proxy_down_nodes)
copied
Kubernetes
1.1.2.1
1.1.3zXuE3xBgzpA4ZB3DDVl4System resources issues
1.1.3
System resources issues
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks for various resource issues in the problem nodes.
inputs
required
required
outputs
Kubernetes
1.1.3
1.1.3.1xMhWpsF8ZXCSDAu1oa45MemoryPressure
1.1.3.1
MemoryPressure
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks if any of the nodes are constrained on memory (experiencing MemoryPressure)
inputs
required
required
outputs
mem_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep MemoryPressure" op = _exe(master_ip, cmd) mem_pressure = op.split()[1] if mem_pressure.lower() == "true": mem_pressure_nodes.append(node) if mem_pressure_nodes: context.log("ERROR", "Mem Pressure Reached") print("Mem Pressure Nodes: ", mem_pressure_nodes)
copied
Kubernetes
1.1.3.1
1.1.3.2OOvNzV1q99d8jz2mXJWnDiskPressure
1.1.3.2
DiskPressure
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks if any of the nodes are constrained on disk space (experiencing DiskPressure)
inputs
required
required
outputs
disk_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep DiskPressure" op = _exe(master_ip, cmd) disk_pressure = op.split()[1] if disk_pressure.lower() == "true": disk_pressure_nodes.append(node) if disk_pressure_nodes: context.log("ERROR", "Disk Pressure Reached") print("Disk Pressure Nodes: ", disk_pressure_nodes)
copied
Kubernetes
1.1.3.2
1.1.3.3F7DA4ir5dBf9nRL9J2wxPIDPressure
1.1.3.3
PIDPressure
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks if any of the nodes are constrained on number of processes (experiencing PIDPressure)
inputs
required
required
outputs
pid_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep PIDPressure" op = _exe(master_ip, cmd) pid_pressure = op.split()[1] if pid_pressure.lower() == "true": pid_pressure_nodes.append(node) if pid_pressure_nodes: context.log("ERROR", "PID Pressure Reached") print("PID Pressure Nodes: ", pid_pressure_nodes)
copied
Kubernetes
1.1.3.3
2F8wtuxdhJl4jrP9ZdtBIUnable to run kubectl
2
Unable to run kubectl
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks for general contraints and errors preventing kubectl itself from being started.
inputs
required
outputs
op = _exe(master_ip, "kubectl get nodes") m = re.match(r'The connection to the server .* was refused - did you specify the right host or port?', op) if m: context.log("ERROR", "Found problem")
copied
Kubernetes
2
2.1S10naUChD7Urqaku3Sg5Machine resources Unavailable
2.1
Machine resources Unavailable
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks for various resource constraints preventing kubectl from running.
inputs
outputs
Linux
2.1
2.1.1Cpxt7BiFk8RmbmJWxzzANo space left on the device
2.1.1
No space left on the device
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
Checks for disk utilization and returns an error if it crossed a given threshold.
inputs
required
outputs
import shutil u = shutil.disk_usage("/") utilization = 100 * u.used / u.total if utilization > threshold: context.log("ERROR", "Disk usage too high")
copied
Linux disk
2.1.1

Troubleshooting Kubernetes Cluster Issues

Resources

Company

Legal