agent: |
Disk Space Monitoring on EC2 Instances
•Use Case: Monitor disk usage on EC2 instances and trigger alerts when usage exceeds thresholds.
•Integrate with CloudWatch to monitor disk space usage.
•Automatically trigger cleanup scripts or scale up disk space when thresholds are exceeded.
- 1TlgTwFLMeGBDiNLx0ftcHealth check for a host
1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.This task performs some basic health checks on a host. Three important things to check for any host is its CPU, memory, and disk space. While there are basic commands to check each of these, the following runbook processes the outputs of those commands and extract the relevant information to see if this host needs attention.
inputsoutputs1- 1.1TqAhiywORnL3GNceEKtTCPU check
1.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.This is a command to check the processes in the Linux host.
inputsoutputstop -b -n 1 | grep "%Cpu"copied1.1 - 1.2cCPPCMvGtekkMjI0Ri6pGet idle CPU percentage
1.2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.The top output above contains what percentage is idle CPU. Split the output and get the part that contains 'id'. Then extract the number.
inputsoutputslines = topoutput.split('\n') line = [x for x in lines if 'Cpu' in x][0] parts = line.split(',') parts[0] = ' '.join(parts[0].split()[1:]) idle_part = [x for x in parts if 'id' in x] idle_cpu = float(idle_part[0].split()[0]) print("Idle CPU: ", idle_cpu) labels = [x.split()[1] for x in parts] values = [x.split()[0] for x in parts] context.plot.xlabel = "" context.plot.ylabel = "" context.plot.title = "CPU usage" context.plot.add_trace(name="CPU usage", xpts=labels, ypts=values, tracetype="pie")copied1.2 - 1.3eBoEGGGbN5ZnaVPmMY01Get CPU utilization of an instance
1.3
Get CPU utilization of an instance
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.This task fetches the CPU utilization data points from AWS CloudWatch and plots it for you.
inputsoutputs1.3- 1.3.1j4NBEQ3UIosQ0njYadROGet instance ID from instance label
1.3.1
Get instance ID from instance label
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputscmd = f"aws ec2 describe-instances --filters Name=ip-address,Values={hostname} --query 'Reservations[*].Instances[*].[InstanceId]' --output text --region us-west-2" print("cmd: ", cmd) cred_label = "aws_read_only_user" op = _exe(None, cmd, cred_label=cred_label) print("op: ", op) instance_id = op.strip() print("Instance Id: ", instance_id)copied1.3.1 - 1.3.2WkqQtVr10BRb4bZcm0zgParse the period string
1.3.2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputsimport re import math import datetime period = period_str max_data_points = 1430 tt = re.search("(\d+)\s*d|(\d+)\s*day|(\d+)\s*days", period_str, re.IGNORECASE) if tt: days = tt.group(1) period = int(days) * 3600 *24 tt = re.search("(\d+)\s*h|(\d+)\s*hr|(\d+)\s*hrs|(\d+)\s*hour|(\d+)\s*hours", period_str, re.IGNORECASE) if tt: hours = tt.group(1) period = int(hours) * 3600 tt = re.search("(\d+)\s*m|(\d+)\s*min|(\d+)\s*mins|(\d+)\s*minute|(\d+)\s*minutes", period_str, re.IGNORECASE) if tt: minutes = tt.group(1) period = int(minutes) * 60 interval = 1 if math.ceil(period) <= 1440: interval = 1 elif math.ceil(period/5) <= 1440: interval = 5 elif math.ceil(period/10) <= 1440: interval = 10 elif math.ceil(period/30) <= 1440: interval = 30 else: interval = math.ceil(period/max_data_points) interval = 60*math.ceil(interval/60) interval = str(interval) now = datetime.datetime.now().timestamp() end_time = datetime.datetime.now().isoformat() then = now - period start_time = datetime.datetime.fromtimestamp(then).isoformat() print("Start time: ", start_time) print("End time: ", end_time) print("Interval: ", interval)copied1.3.2 - 1.3.3pbuaFPp0TWNEbLjVtzf2Plot CPU utilization
1.3.3
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputsimport json cmd = 'aws cloudwatch get-metric-statistics --metric-name CPUUtilization' cmd += ' --region ' + 'us-west-2' cmd += ' --start-time ' + start_time cmd += ' --end-time ' + end_time cmd += ' --period ' + interval cmd += ' --namespace AWS/EC2 --statistics Maximum --dimensions Name=InstanceId,Value=' + instance_id op = _exe(None, cmd, cred_label="aws_read_only_user") # print(op) exception_msg = "" try: jsn = json.loads(op) except Exception as e: exception_msg = "Got this exception: \n" exception_msg += str(e) exception_msg += "\n" exception_msg += op if not exception_msg: datapoints = sorted(jsn['Datapoints'], key = lambda i: i['Timestamp']) x = [x['Timestamp'] for x in datapoints] y = [x['Maximum'] for x in datapoints] # trace['type'] = 'scatter' # trace['mode'] = 'lines' context.plot.xlabel = 'timestamp' context.plot.ylabel = 'CPU % util' context.plot.title = 'CPU util for instance: ' + str(instance_id) context.plot.add_trace(name="CPU util", xpts=x, ypts=y, tracetype="lines")copied1.3.3
- 1.4l53XJPcsxN6tYND3kUbRCheck memory
1.4
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Check how much free memory is available on this host.
inputsoutputsfreecopied1.4 - 1.5Igv9ZjcBuv3q0QpabDLNGet free memory percentage
1.5
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Calculate the percentage of the free memory and if it is below a threshold, create an alert.
inputsoutputsmem_lines = freeoutput.split('\n') mem_line = [x for x in mem_lines if 'Mem:' in x][0] total_mem = float(mem_line.split()[1]) free_mem = float(mem_line.split()[3]) free_mem_perc = free_mem*100/total_mem print(free_mem_perc)copied1.5 - 1.6RABId8yPOmqYNC0zePe2Get the top memory consumers
1.6
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Identify the culprits in memory consumption.
inputsoutputsimport time n = "10" cmd = f'top -b -n 1 -o %MEM | grep "%MEM" -A {n}' # op = _exe(hostname, cmd, cred_label=cred_label) for i in range(0,3): op = _exe(hostname, cmd, cred_label=cred_label) if op: break else: time.sleep(1) print(op) procdata = opcopied1.6- 1.6.1B46PApH5WgpR8TYE0urDPlot top memory consumers
1.6.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Visualize the memory consumption by plotting it against the consumers.
inputsoutputslines = procdata.split('\n')[1:] xs = [] ys = [] for line in lines: parts = line.split() if len(parts) == 12: ys.append(float(parts[9])) xs.append(str(parts[11])) context.plot.xlabel = "processes" context.plot.ylabel = "% memory consumed" context.plot.title = "Top memory consumers" context.plot.add_trace(name="Top memory consumers", xpts=xs, ypts=ys, tracetype="bar")copied1.6.1
- 1.7GJ2vjbOLqB9Q7HpjpUaVCheck disk space
1.7
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.This command checks for the consumed disk space at the root.
inputsoutputsfor i in range(0,3): dfoutput = _exe(hostname, 'df /', cred_label=cred_label) if dfoutput: break print(dfoutput)copied1.7- 1.7.1fkgSSXIaebh6Sxi4EyygGet available disk space
1.7.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Process the output of df and extract the available disk space percentage.
inputsoutputsparts = dfoutput.split() perc_parts = [x for x in parts if '%' in x] used_disk_space = float(perc_parts[1].strip('%')) available_disk = 100.0 - used_disk_space print(available_disk)copied1.7.1 - 1.7.2t8xsXDSIdIwlxojVl9ixPerform disk cleanup if needed
1.7.2
Perform disk cleanup if needed
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Check the available disk space percentage against a threshold and if it drops below it then trigger disk cleanup.
inputsoutputsif available_disk > 25.0: context.skip_sub_tasks = Truecopied1.7.2- 1.7.2.1vbE8Q3dXkJDvAUrwWQ4BNotify about disk space before cleaning up
1.7.2.1
Notify about disk space before cleaning up
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputschannel = "alerts" message = "The disk has filled up! Only this much left: " + str(available_disk) message += "\n Going to clean up the disk with docker system prune -a -f" cred_label = "slack_creds"copied1.7.2.1- 1.7.2.1.1jyZjUv9D2aHq1mJJfvtyPost a message to a Slack channel
1.7.2.1.1
Post a message to a Slack channel
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.
inputsoutputsprint("Message: ", message) print("channel: ", channel) # print("cred_label: ", cred_label) # print("num_duplicates: ", num_duplicates) from slack_sdk import WebClient from slack_sdk.errors import SlackApiError # Retrieve credentials based on a label provided creds_resp = _get_creds(cred_label) creds = creds_resp['creds'] slack_token = creds['password'] # Initialize the Slack client client = WebClient(token=slack_token) # Function to send message to Slack def send_message_to_slack(channel, message): try: response = client.chat_postMessage( channel=channel, text=message ) print("Message sent successfully:", response["ts"]) except SlackApiError as e: print(f"Error sending message: {e.response['error']}") send_message_to_slack(channel, message)copied1.7.2.1.1
- 1.7.2.2mHtPaDwqLVtVZfeFeAdVClean up disk
1.7.2.2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.This command prunes all the unused images that bloat our storage. It does not touch the ones that are in use. Deletes stopped containers too.
inputsoutputscmd = "docker system prune -a -f --volumes" op = _exe(hostname, cmd) print(op)copied1.7.2.2 - 1.7.2.3koLpXP0IGnnApRrcfU5QNotify again after cleaning up the disk
1.7.2.3
Notify again after cleaning up the disk
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputschannel = "alerts" message = "After cleaning up, here's the disk space: \n" message += '```\n' message += _exe(hostname, "df /") message += '```\n' cred_label = "slack_creds"copied1.7.2.3- 1.7.2.3.1jyZjUv9D2aHq1mJJfvtyPost a message to a Slack channel
1.7.2.3.1
Post a message to a Slack channel
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Post the formatted message to a given Slack channel. Use the cred_label to get the right credentials stored in the backed.
inputsoutputsprint("Message: ", message) print("channel: ", channel) # print("cred_label: ", cred_label) # print("num_duplicates: ", num_duplicates) from slack_sdk import WebClient from slack_sdk.errors import SlackApiError # Retrieve credentials based on a label provided creds_resp = _get_creds(cred_label) creds = creds_resp['creds'] slack_token = creds['password'] # Initialize the Slack client client = WebClient(token=slack_token) # Function to send message to Slack def send_message_to_slack(channel, message): try: response = client.chat_postMessage( channel=channel, text=message ) print("Message sent successfully:", response["ts"]) except SlackApiError as e: print(f"Error sending message: {e.response['error']}") send_message_to_slack(channel, message)copied1.7.2.3.1