Sign in

Health check for a host

There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

This task performs some basic health checks on a host. Three important things to check for any host is its CPU, memory, and disk space. While there are basic commands to check each of these, the following runbook processes the outputs of those commands and extract the relevant information to see if this host needs attention.

  1. 1

    CPU check

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This is a command to check the processes in the Linux host.

    1
  2. 2

    Get idle CPU percentage

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    The top output above contains what percentage is idle CPU. Split the output and get the part that contains 'id'. Then extract the number.

    2
  3. 3

    Get CPU utilization of an instance

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This task fetches the CPU utilization data points from AWS CloudWatch and plots it for you.

    3
    1. 3.1

      Get instance ID from instance label

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      3.1
    2. 3.2

      Parse the period string

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      3.2
    3. 3.3

      Plot CPU utilization

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      3.3
  4. 4

    Check memory

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Check how much free memory is available on this host.

    4
  5. 5

    Get free memory percentage

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Calculate the percentage of the free memory and if it is below a threshold, create an alert.

    5
  6. 6

    Get the top memory consumers

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Identify the culprits in memory consumption.

    6
    1. 6.1

      Plot top memory consumers

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Visualize the memory consumption by plotting it against the consumers.

      6.1
  7. 7

    Check disk space

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This command checks for the consumed disk space at the root.

    7
    1. 7.1

      Get available disk space

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Process the output of df and extract the available disk space percentage.

      7.1
    2. 7.2

      Perform disk cleanup if needed

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Check the available disk space percentage against a threshold and if it drops below it then trigger disk cleanup.

      7.2
      1. 7.2.1

        Notify about disk space before cleaning up

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        7.2.1
        1. 7.2.1.1

          Post a message to a Slack channel

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.

          7.2.1.1
      2. 7.2.2

        Clean up disk

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        This command prunes all the unused images that bloat our storage. It does not touch the ones that are in use. Deletes stopped containers too.

        7.2.2
      3. 7.2.3

        Notify again after cleaning up the disk

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        7.2.3
        1. 7.2.3.1

          Post a message to a Slack channel

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.

          7.2.3.1