Troubleshooting Kubernetes Cluster Issues

A comprehensive runbook to check and alert on common issues with Kubernetes clusters

  1. 1

    Check if a cluster is in a ready state. Task returns a list of nodes that are no in a ready state. If all nodes are in a ready state then an empty list is returned and no further checks in the sub tasks are performed.

    cmd = 'kubectl get nodes -o wide | grep -v "master\|VERSION"' op = _exe(master_ip, cmd) lines = op.split('\n') #print(lines) context.skip_sub_tasks = True problem_nodes = [] for line in lines: if line: status = line.split()[1] nodename = line.split()[0] nodeip = line.split()[5] #print(status) if status.lower() == "notready": _problem = True context.log("ERROR", f"Pod not ready") problem_nodes.append({"nodename" : nodename, "nodeip": nodeip}) context.skip_sub_tasks = False if not context.skip_sub_tasks: print(problem_nodes)
    copied
    1
    1. 1.1

      For nodes that are in a non-ready state, checks which of them went down by checking for their Node IPs

      cmd = "aws ec2 describe-instances" cmd += " --filters Name=instance-state-name,Values=running" cmd += " --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,PublicIpAddress,PrivateIpAddress,Tags[?Key==`Name`].Value | [0]]'" cmd += " --output text" op = _exe(None, cmd) #print(op) lines = op.split() context.skip_sub_tasks = True down_nodes = [] new_problem_nodes = [] for node in problem_nodes: if node['nodeip'] not in op: context.log("WARNING", "nodeip missing") context.skip_sub_tasks = False down_nodes.append(node) else: for line in lines: if node['nodeip'] in line: instance_id = line.split()[0] node['instance_id'] = instance_id new_problem_nodes.append(node) print(down_nodes) print(new_problem_nodes) problem_nodes = new_problem_nodes if len(problem_nodes) > 0: context.skip_sub_tasks = False
      copied
      1.1
      1. 1.1.1

        Runs checks to see node readyness issues are due to the Kubelet. Some kubelet issues include unresponsive kubelets, dead kubelets or kubelets that have exited soon.

        1.1.1
        1. 1.1.1.1

          Returns nodes where kubelet had stopped posting status updates.

          context.skip_sub_tasks = True no_status_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename}" op = _exe(master_ip, cmd) if "kubelet stopped posting node status" in op.lower(): context.skip_sub_tasks = False no_status_nodes.append(node) print(no_status_nodes)
          copied
          1.1.1.1
          1. 1.1.1.1.1

            Identifies nodes where the Kubelet is in an inactive or dead state.

            dead_kubelet_nodes = [] for node in no_status_nodes: instance_id = node['instance_id'] cmd = "sudo systemctl status kubelet | grep 'Active:'" op = _exe(instance_id, cmd) if "inactive (dead)" in op: dead_kubelet_nodes.append(node) context.skip_sub_tasks = len(dead_kubelet_nodes) == 0 print(dead_kubelet_nodes)
            copied
            1.1.1.1.1
          2. 1.1.1.1.2

            Identifies kubelets that are in active but exited state.

            exited_kubelet_nodes = [] for node in no_status_nodes: instance_id = node['instance_id'] cmd = "sudo systemctl status kubelet | grep 'Active:'" op = _exe(instance_id, cmd) if "active (exited)" in op: exited_kubelet_nodes.append(node) context.skip_sub_tasks = len(exited_kubelet_nodes) == 0 print(exited_kubelet_nodes)
            copied
            1.1.1.1.2
          3. 1.1.1.1.3

            Gets the master instance IP addresses in the cluster and uses it to perform connectivity checks.

            for node in no_status_nodes: nodeip = node['nodeip'] cmd = "sudo systemctl status kubelet" op = _exe(nodeip, cmd) #_problem = False if "Unable to register node with API server" in op: pass master_ip_addr = _get_ip_addr(master_ip) cmd1 = f'aws ec2 describe-instances --filters "Name=ip-address,Values={master_ip_addr}" --query "Reservations[*].Instances[*].InstanceId" --output text' master_instance_ids = [_exe(None, cmd1).strip()] print(master_instance_ids)
            copied
            1.1.1.1.3
            1. 1.1.1.1.3.1

              Looks at the security groups of the master instances (by their IDs) to check if there are possible port configuration mismatches preventing connectivity.

              import json for instance_id in master_instance_ids: _problem = True cmd = f"aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[*].Instances[*].SecurityGroups[*].GroupId' --output=text" sg_ids1 = _exe(None, cmd) print(sg_ids1) sg_ids = re.split('\s',sg_ids1.strip()) if sg_ids: for sg_id in sg_ids: if not sg_id: continue cmd1 = 'aws ec2 describe-security-groups --filter Name=group-id,Values=' cmd1+= sg_id cmd1+= ' --query SecurityGroups[*].IpPermissions[*]' op = _exe(None, cmd1) json_op = json.loads(op) for sg in json_op: for rule in sg: if 'FromPort' in rule: port_lo = int(rule['FromPort']) port_hi = port_lo if 'ToPort' in rule: port_hi = int(rule['ToPort']) if port >= port_lo and port <= port_hi: _problem = False else: continue if _problem: break if _problem: context.log("ERROR", "Found problem")
              copied
              1.1.1.1.3.1
      2. 1.1.2

        1.1.2
        1. 1.1.2.1

          cmd = f"kubectl get pods -n kube-system -o wide | grep kube-proxy | grep Running" op = _exe(master_ip, cmd) kube_proxy_down_nodes = [] for node in problem_nodes: nodeip = node['nodeip'] if nodeip not in op: context.log("ERROR", "Found problem node. nodeip missing") kube_proxy_down_nodes.append(node) print("Down Nodes: ", kube_proxy_down_nodes)
          copied
          1.1.2.1
      3. 1.1.3

        Checks for various resource issues in the problem nodes.

        1.1.3
        1. 1.1.3.1

          Checks if any of the nodes are constrained on memory (experiencing MemoryPressure)

          mem_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep MemoryPressure" op = _exe(master_ip, cmd) mem_pressure = op.split()[1] if mem_pressure.lower() == "true": mem_pressure_nodes.append(node) if mem_pressure_nodes: context.log("ERROR", "Mem Pressure Reached") print("Mem Pressure Nodes: ", mem_pressure_nodes)
          copied
          1.1.3.1
        2. 1.1.3.2

          Checks if any of the nodes are constrained on disk space (experiencing DiskPressure)

          disk_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep DiskPressure" op = _exe(master_ip, cmd) disk_pressure = op.split()[1] if disk_pressure.lower() == "true": disk_pressure_nodes.append(node) if disk_pressure_nodes: context.log("ERROR", "Disk Pressure Reached") print("Disk Pressure Nodes: ", disk_pressure_nodes)
          copied
          1.1.3.2
        3. 1.1.3.3

          Checks if any of the nodes are constrained on number of processes (experiencing PIDPressure)

          pid_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep PIDPressure" op = _exe(master_ip, cmd) pid_pressure = op.split()[1] if pid_pressure.lower() == "true": pid_pressure_nodes.append(node) if pid_pressure_nodes: context.log("ERROR", "PID Pressure Reached") print("PID Pressure Nodes: ", pid_pressure_nodes)
          copied
          1.1.3.3
  2. 2

    Checks for general contraints and errors preventing kubectl itself from being started.

    op = _exe(master_ip, "kubectl get nodes") m = re.match(r'The connection to the server .* was refused - did you specify the right host or port?', op) if m: context.log("ERROR", "Found problem")
    copied
    2
    1. 2.1

      Checks for various resource constraints preventing kubectl from running.

      2.1
      1. 2.1.1

        Checks for disk utilization and returns an error if it crossed a given threshold.

        import shutil u = shutil.disk_usage("/") utilization = 100 * u.used / u.total if utilization > threshold: context.log("ERROR", "Disk usage too high")
        copied
        2.1.1