Sign in

Troubleshooting Kubernetes Cluster Issues

There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

A comprehensive runbook to check and alert on common issues with Kubernetes clusters

  1. 1

    Cluster not ready

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Check if a cluster is in a ready state. Task returns a list of nodes that are no in a ready state. If all nodes are in a ready state then an empty list is returned and no further checks in the sub tasks are performed.

    1
    1. 1.1

      Check if the node went down

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      For nodes that are in a non-ready state, checks which of them went down by checking for their Node IPs

      1.1
      1. 1.1.1

        kubelet issue

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        Runs checks to see node readyness issues are due to the Kubelet. Some kubelet issues include unresponsive kubelets, dead kubelets or kubelets that have exited soon.

        1.1.1
        1. 1.1.1.1

          Kubelet stopped posting node status

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Returns nodes where kubelet had stopped posting status updates.

          1.1.1.1
          1. 1.1.1.1.1

            Kubelet Dead

            There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

            Identifies nodes where the Kubelet is in an inactive or dead state.

            1.1.1.1.1
          2. 1.1.1.1.2

            Kubelet exited

            There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

            Identifies kubelets that are in active but exited state.

            1.1.1.1.2
          3. 1.1.1.1.3

            Worker kubelet unable to update master

            There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

            Gets the master instance IP addresses in the cluster and uses it to perform connectivity checks.

            1.1.1.1.3
            1. 1.1.1.1.3.1

              Security group blocked the API request

              There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

              Looks at the security groups of the master instances (by their IDs) to check if there are possible port configuration mismatches preventing connectivity.

              1.1.1.1.3.1
      2. 1.1.2

        Kube-proxy issue

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        1.1.2
        1. 1.1.2.1

          kube-proxy not running

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
          1.1.2.1
      3. 1.1.3

        System resources issues

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        Checks for various resource issues in the problem nodes.

        1.1.3
        1. 1.1.3.1

          MemoryPressure

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Checks if any of the nodes are constrained on memory (experiencing MemoryPressure)

          1.1.3.1
        2. 1.1.3.2

          DiskPressure

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Checks if any of the nodes are constrained on disk space (experiencing DiskPressure)

          1.1.3.2
        3. 1.1.3.3

          PIDPressure

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Checks if any of the nodes are constrained on number of processes (experiencing PIDPressure)

          1.1.3.3
  2. 2

    Unable to run kubectl

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Checks for general contraints and errors preventing kubectl itself from being started.

    2
    1. 2.1

      Machine resources Unavailable

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Checks for various resource constraints preventing kubectl from running.

      2.1
      1. 2.1.1

        No space left on the device

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        Checks for disk utilization and returns an error if it crossed a given threshold.

        2.1.1