Sign in

Health check for a host

There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

This task performs some basic health checks on a host. Three important things to check for any host is its CPU, memory, and disk space. While there are basic commands to check each of these, the following runbook processes the outputs of those commands and extract the relevant information to see if this host needs attention.

  1. 1

    CPU check

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This is a command to check the processes in the Linux host.

    top -b -n 1 | grep "%Cpu"
    copied
    1
  2. 2

    Get idle CPU percentage

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    The top output above contains what percentage is idle CPU. Split the output and get the part that contains 'id'. Then extract the number.

    lines = topoutput.split('\n') line = [x for x in lines if 'Cpu' in x][0] parts = line.split(',') parts[0] = ' '.join(parts[0].split()[1:]) idle_part = [x for x in parts if 'id' in x] idle_cpu = float(idle_part[0].split()[0]) print("Idle CPU: ", idle_cpu) labels = [x.split()[1] for x in parts] values = [x.split()[0] for x in parts] context.plot.xlabel = "" context.plot.ylabel = "" context.plot.title = "CPU usage" context.plot.add_trace(name="CPU usage", xpts=labels, ypts=values, tracetype="pie")
    copied
    2
  3. 3

    Get CPU utilization of an instance

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This task fetches the CPU utilization data points from AWS CloudWatch and plots it for you.

    3
    1. 3.1

      Get instance ID from instance label

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      cmd = f"aws ec2 describe-instances --filters Name=ip-address,Values={hostname} --query 'Reservations[*].Instances[*].[InstanceId]' --output text --region us-west-2" print("cmd: ", cmd) cred_label = "aws_read_only_user" op = _exe(None, cmd, cred_label=cred_label) print("op: ", op) instance_id = op.strip() print("Instance Id: ", instance_id)
      copied
      3.1
    2. 3.2

      Parse the period string

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      import re import math import datetime period = period_str max_data_points = 1430 tt = re.search("(\d+)\s*d|(\d+)\s*day|(\d+)\s*days", period_str, re.IGNORECASE) if tt: days = tt.group(1) period = int(days) * 3600 *24 tt = re.search("(\d+)\s*h|(\d+)\s*hr|(\d+)\s*hrs|(\d+)\s*hour|(\d+)\s*hours", period_str, re.IGNORECASE) if tt: hours = tt.group(1) period = int(hours) * 3600 tt = re.search("(\d+)\s*m|(\d+)\s*min|(\d+)\s*mins|(\d+)\s*minute|(\d+)\s*minutes", period_str, re.IGNORECASE) if tt: minutes = tt.group(1) period = int(minutes) * 60 interval = 1 if math.ceil(period) <= 1440: interval = 1 elif math.ceil(period/5) <= 1440: interval = 5 elif math.ceil(period/10) <= 1440: interval = 10 elif math.ceil(period/30) <= 1440: interval = 30 else: interval = math.ceil(period/max_data_points) interval = 60*math.ceil(interval/60) interval = str(interval) now = datetime.datetime.now().timestamp() end_time = datetime.datetime.now().isoformat() then = now - period start_time = datetime.datetime.fromtimestamp(then).isoformat() print("Start time: ", start_time) print("End time: ", end_time) print("Interval: ", interval)
      copied
      3.2
    3. 3.3

      Plot CPU utilization

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      import json cmd = 'aws cloudwatch get-metric-statistics --metric-name CPUUtilization' cmd += ' --region ' + 'us-west-2' cmd += ' --start-time ' + start_time cmd += ' --end-time ' + end_time cmd += ' --period ' + interval cmd += ' --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=' + instance_id op = _exe(None, cmd, cred_label="aws_read_only_user") # print(op) exception_msg = "" try: jsn = json.loads(op) except Exception as e: exception_msg = "Got this exception: \n" exception_msg += str(e) exception_msg += "\n" exception_msg += op if not exception_msg: datapoints = sorted(jsn['Datapoints'], key = lambda i: i['Timestamp']) x = [x['Timestamp'] for x in datapoints] y = [x['Maximum'] for x in datapoints] # trace['type'] = 'scatter' # trace['mode'] = 'lines' context.plot.xlabel = 'timestamp' context.plot.ylabel = 'CPU % util' context.plot.title = 'CPU util for instance: ' + str(instance_id) context.plot.add_trace(name="CPU util", xpts=x, ypts=y, tracetype="lines")
      copied
      3.3
  4. 4

    Check memory

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Check how much free memory is available on this host.

    free
    copied
    4
  5. 5

    Get free memory percentage

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Calculate the percentage of the free memory and if it is below a threshold, create an alert.

    mem_lines = freeoutput.split('\n') mem_line = [x for x in mem_lines if 'Mem:' in x][0] total_mem = float(mem_line.split()[1]) free_mem = float(mem_line.split()[3]) free_mem_perc = free_mem*100/total_mem print(free_mem_perc)
    copied
    5
  6. 6

    Get the top memory consumers

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Identify the culprits in memory consumption.

    import time n = "10" cmd = f'top -b -n 1 -o %MEM | grep "%MEM" -A {n}' # op = _exe(hostname, cmd, cred_label=cred_label) for i in range(0,3): op = _exe(hostname, cmd, cred_label=cred_label) if op: break else: time.sleep(1) print(op) procdata = op
    copied
    6
    1. 6.1

      Plot top memory consumers

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Visualize the memory consumption by plotting it against the consumers.

      lines = procdata.split('\n')[1:] xs = [] ys = [] for line in lines: parts = line.split() if len(parts) == 12: ys.append(float(parts[9])) xs.append(str(parts[11])) context.plot.xlabel = "processes" context.plot.ylabel = "% memory consumed" context.plot.title = "Top memory consumers" context.plot.add_trace(name="Top memory consumers", xpts=xs, ypts=ys, tracetype="bar")
      copied
      6.1
  7. 7

    Check disk space

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This command checks for the consumed disk space at the root.

    for i in range(0,3): dfoutput = _exe(hostname, 'df /', cred_label=cred_label) if dfoutput: break print(dfoutput)
    copied
    7
    1. 7.1

      Get available disk space

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Process the output of df and extract the available disk space percentage.

      parts = dfoutput.split() perc_parts = [x for x in parts if '%' in x] used_disk_space = float(perc_parts[1].strip('%')) available_disk = 100.0 - used_disk_space print(available_disk)
      copied
      7.1
    2. 7.2

      Perform disk cleanup if needed

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Check the available disk space percentage against a threshold and if it drops below it then trigger disk cleanup.

      if available_disk > 25.0: context.skip_sub_tasks = True
      copied
      7.2
      1. 7.2.1

        Notify about disk space before cleaning up

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        channel = "alerts" message = "The disk has filled up! Only this much left: " + str(available_disk) message += "\n Going to clean up the disk with docker system prune -a -f" cred_label = "slack_creds"
        copied
        7.2.1
        1. 7.2.1.1

          Post a message to a Slack channel

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.

          print("Message: ", message) print("channel: ", channel) # print("cred_label: ", cred_label) # print("num_duplicates: ", num_duplicates) from slack_sdk import WebClient from slack_sdk.errors import SlackApiError # Retrieve credentials based on a label provided creds_resp = _get_creds(cred_label) creds = creds_resp['creds'] slack_token = creds['password'] # Initialize the Slack client client = WebClient(token=slack_token) # Function to send message to Slack def send_message_to_slack(channel, message): try: response = client.chat_postMessage( channel=channel, text=message ) print("Message sent successfully:", response["ts"]) except SlackApiError as e: print(f"Error sending message: {e.response['error']}") send_message_to_slack(channel, message)
          copied
          7.2.1.1
      2. 7.2.2

        Clean up disk

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        This command prunes all the unused images that bloat our storage. It does not touch the ones that are in use. Deletes stopped containers too.

        cmd = "docker system prune -a -f --volumes" op = _exe(hostname, cmd) print(op)
        copied
        7.2.2
      3. 7.2.3

        Notify again after cleaning up the disk

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        channel = "alerts" message = "After cleaning up, here's the disk space: \n" message += '```\n' message += _exe(hostname, "df /") message += '```\n' cred_label = "slack_creds"
        copied
        7.2.3
        1. 7.2.3.1

          Post a message to a Slack channel

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.

          print("Message: ", message) print("channel: ", channel) # print("cred_label: ", cred_label) # print("num_duplicates: ", num_duplicates) from slack_sdk import WebClient from slack_sdk.errors import SlackApiError # Retrieve credentials based on a label provided creds_resp = _get_creds(cred_label) creds = creds_resp['creds'] slack_token = creds['password'] # Initialize the Slack client client = WebClient(token=slack_token) # Function to send message to Slack def send_message_to_slack(channel, message): try: response = client.chat_postMessage( channel=channel, text=message ) print("Message sent successfully:", response["ts"]) except SlackApiError as e: print(f"Error sending message: {e.response['error']}") send_message_to_slack(channel, message)
          copied
          7.2.3.1