Sign in

Disk Space Monitoring on EC2 Instances

There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

•Use Case: Monitor disk usage on EC2 instances and trigger alerts when usage exceeds thresholds.

•Integrate with CloudWatch to monitor disk space usage.

•Automatically trigger cleanup scripts or scale up disk space when thresholds are exceeded.

  1. 1

    Health check for a host

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This task performs some basic health checks on a host. Three important things to check for any host is its CPU, memory, and disk space. While there are basic commands to check each of these, the following runbook processes the outputs of those commands and extract the relevant information to see if this host needs attention.

    1
    1. 1.1

      CPU check

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      This is a command to check the processes in the Linux host.

      top -b -n 1 | grep "%Cpu"
      copied
      1.1
    2. 1.2

      Get idle CPU percentage

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      The top output above contains what percentage is idle CPU. Split the output and get the part that contains 'id'. Then extract the number.

      lines = topoutput.split('\n') line = [x for x in lines if 'Cpu' in x][0] parts = line.split(',') parts[0] = ' '.join(parts[0].split()[1:]) idle_part = [x for x in parts if 'id' in x] idle_cpu = float(idle_part[0].split()[0]) print("Idle CPU: ", idle_cpu) labels = [x.split()[1] for x in parts] values = [x.split()[0] for x in parts] context.plot.xlabel = "" context.plot.ylabel = "" context.plot.title = "CPU usage" context.plot.add_trace(name="CPU usage", xpts=labels, ypts=values, tracetype="pie")
      copied
      1.2
    3. 1.3

      Get CPU utilization of an instance

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      This task fetches the CPU utilization data points from AWS CloudWatch and plots it for you.

      1.3
      1. 1.3.1

        Get instance ID from instance label

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        cmd = f"aws ec2 describe-instances --filters Name=ip-address,Values={hostname} --query 'Reservations[*].Instances[*].[InstanceId]' --output text --region us-west-2" print("cmd: ", cmd) cred_label = "aws_read_only_user" op = _exe(None, cmd, cred_label=cred_label) print("op: ", op) instance_id = op.strip() print("Instance Id: ", instance_id)
        copied
        1.3.1
      2. 1.3.2

        Parse the period string

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        import re import math import datetime period = period_str max_data_points = 1430 tt = re.search("(\d+)\s*d|(\d+)\s*day|(\d+)\s*days", period_str, re.IGNORECASE) if tt: days = tt.group(1) period = int(days) * 3600 *24 tt = re.search("(\d+)\s*h|(\d+)\s*hr|(\d+)\s*hrs|(\d+)\s*hour|(\d+)\s*hours", period_str, re.IGNORECASE) if tt: hours = tt.group(1) period = int(hours) * 3600 tt = re.search("(\d+)\s*m|(\d+)\s*min|(\d+)\s*mins|(\d+)\s*minute|(\d+)\s*minutes", period_str, re.IGNORECASE) if tt: minutes = tt.group(1) period = int(minutes) * 60 interval = 1 if math.ceil(period) <= 1440: interval = 1 elif math.ceil(period/5) <= 1440: interval = 5 elif math.ceil(period/10) <= 1440: interval = 10 elif math.ceil(period/30) <= 1440: interval = 30 else: interval = math.ceil(period/max_data_points) interval = 60*math.ceil(interval/60) interval = str(interval) now = datetime.datetime.now().timestamp() end_time = datetime.datetime.now().isoformat() then = now - period start_time = datetime.datetime.fromtimestamp(then).isoformat() print("Start time: ", start_time) print("End time: ", end_time) print("Interval: ", interval)
        copied
        1.3.2
      3. 1.3.3

        Plot CPU utilization

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        import json cmd = 'aws cloudwatch get-metric-statistics --metric-name CPUUtilization' cmd += ' --region ' + 'us-west-2' cmd += ' --start-time ' + start_time cmd += ' --end-time ' + end_time cmd += ' --period ' + interval cmd += ' --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=' + instance_id op = _exe(None, cmd, cred_label="aws_read_only_user") # print(op) exception_msg = "" try: jsn = json.loads(op) except Exception as e: exception_msg = "Got this exception: \n" exception_msg += str(e) exception_msg += "\n" exception_msg += op if not exception_msg: datapoints = sorted(jsn['Datapoints'], key = lambda i: i['Timestamp']) x = [x['Timestamp'] for x in datapoints] y = [x['Maximum'] for x in datapoints] # trace['type'] = 'scatter' # trace['mode'] = 'lines' context.plot.xlabel = 'timestamp' context.plot.ylabel = 'CPU % util' context.plot.title = 'CPU util for instance: ' + str(instance_id) context.plot.add_trace(name="CPU util", xpts=x, ypts=y, tracetype="lines")
        copied
        1.3.3
    4. 1.4

      Check memory

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Check how much free memory is available on this host.

      free
      copied
      1.4
    5. 1.5

      Get free memory percentage

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Calculate the percentage of the free memory and if it is below a threshold, create an alert.

      mem_lines = freeoutput.split('\n') mem_line = [x for x in mem_lines if 'Mem:' in x][0] total_mem = float(mem_line.split()[1]) free_mem = float(mem_line.split()[3]) free_mem_perc = free_mem*100/total_mem print(free_mem_perc)
      copied
      1.5
    6. 1.6

      Get the top memory consumers

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Identify the culprits in memory consumption.

      import time n = "10" cmd = f'top -b -n 1 -o %MEM | grep "%MEM" -A {n}' # op = _exe(hostname, cmd, cred_label=cred_label) for i in range(0,3): op = _exe(hostname, cmd, cred_label=cred_label) if op: break else: time.sleep(1) print(op) procdata = op
      copied
      1.6
      1. 1.6.1

        Plot top memory consumers

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        Visualize the memory consumption by plotting it against the consumers.

        lines = procdata.split('\n')[1:] xs = [] ys = [] for line in lines: parts = line.split() if len(parts) == 12: ys.append(float(parts[9])) xs.append(str(parts[11])) context.plot.xlabel = "processes" context.plot.ylabel = "% memory consumed" context.plot.title = "Top memory consumers" context.plot.add_trace(name="Top memory consumers", xpts=xs, ypts=ys, tracetype="bar")
        copied
        1.6.1
    7. 1.7

      Check disk space

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      This command checks for the consumed disk space at the root.

      for i in range(0,3): dfoutput = _exe(hostname, 'df /', cred_label=cred_label) if dfoutput: break print(dfoutput)
      copied
      1.7
      1. 1.7.1

        Get available disk space

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        Process the output of df and extract the available disk space percentage.

        parts = dfoutput.split() perc_parts = [x for x in parts if '%' in x] used_disk_space = float(perc_parts[1].strip('%')) available_disk = 100.0 - used_disk_space print(available_disk)
        copied
        1.7.1
      2. 1.7.2

        Perform disk cleanup if needed

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        Check the available disk space percentage against a threshold and if it drops below it then trigger disk cleanup.

        if available_disk > 25.0: context.skip_sub_tasks = True
        copied
        1.7.2
        1. 1.7.2.1

          Notify about disk space before cleaning up

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
          channel = "alerts" message = "The disk has filled up! Only this much left: " + str(available_disk) message += "\n Going to clean up the disk with docker system prune -a -f" cred_label = "slack_creds"
          copied
          1.7.2.1
          1. 1.7.2.1.1

            Post a message to a Slack channel

            There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

            Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.

            print("Message: ", message) print("channel: ", channel) # print("cred_label: ", cred_label) # print("num_duplicates: ", num_duplicates) from slack_sdk import WebClient from slack_sdk.errors import SlackApiError # Retrieve credentials based on a label provided creds_resp = _get_creds(cred_label) creds = creds_resp['creds'] slack_token = creds['password'] # Initialize the Slack client client = WebClient(token=slack_token) # Function to send message to Slack def send_message_to_slack(channel, message): try: response = client.chat_postMessage( channel=channel, text=message ) print("Message sent successfully:", response["ts"]) except SlackApiError as e: print(f"Error sending message: {e.response['error']}") send_message_to_slack(channel, message)
            copied
            1.7.2.1.1
        2. 1.7.2.2

          Clean up disk

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

          This command prunes all the unused images that bloat our storage. It does not touch the ones that are in use. Deletes stopped containers too.

          cmd = "docker system prune -a -f --volumes" op = _exe(hostname, cmd) print(op)
          copied
          1.7.2.2
        3. 1.7.2.3

          Notify again after cleaning up the disk

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
          channel = "alerts" message = "After cleaning up, here's the disk space: \n" message += '```\n' message += _exe(hostname, "df /") message += '```\n' cred_label = "slack_creds"
          copied
          1.7.2.3
          1. 1.7.2.3.1

            Post a message to a Slack channel

            There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

            Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.

            print("Message: ", message) print("channel: ", channel) # print("cred_label: ", cred_label) # print("num_duplicates: ", num_duplicates) from slack_sdk import WebClient from slack_sdk.errors import SlackApiError # Retrieve credentials based on a label provided creds_resp = _get_creds(cred_label) creds = creds_resp['creds'] slack_token = creds['password'] # Initialize the Slack client client = WebClient(token=slack_token) # Function to send message to Slack def send_message_to_slack(channel, message): try: response = client.chat_postMessage( channel=channel, text=message ) print("Message sent successfully:", response["ts"]) except SlackApiError as e: print(f"Error sending message: {e.response['error']}") send_message_to_slack(channel, message)
            copied
            1.7.2.3.1