agent: |
Application not responding
This runbooks is an end to end health check automation for our application.
- 1qhpAWoICsVWanICCTS7WGet instance id of a hostname
1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.This task just gets the ec2 instance id for a given host based on the label.
inputsoutputscmd = f'aws ec2 describe-instances --filters "Name=tag:Name,Values={hostname}" --query "Reservations[].Instances[].InstanceId" --output text' op = _exe(None, cmd) instance_id = op.strip() context.problem = True print(f"Instance id for {hostname} is {instance_id}")copied1 - 2ODmnpJrpOYeYwLVgNCe3Ensure that the instance is running
2
Ensure that the instance is running
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Get the instance state and check if the instance is running. If it is not running, start it.
inputsoutputscmd = f'aws ec2 describe-instances --instance-ids {instance_id} --query "Reservations[].Instances[].State.Name" --output text' op = _exe(None, cmd) _problem = True if "running" in op: _problem = False _proceed = not _problem host_is_up = not _problem if host_is_up: msg = "Host is up" msg_type = "SUCCESS" else: msg = "Host is not up!" msg_type = "ERROR" print(msg) context.log(msg_type, msg) context.skip_sub_tasks = host_is_up task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}copied2- 2.1kCWBHxpZilKc9mcoNC5HStart an ec2 instance
2.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputscmd = f'aws ec2 start-instances --instance-ids {instance_id}' op = _exe(None, cmd) msg = "Instance was down. Restarted the instance" print(msg) task_title = context.task_title msg_type = "SUCCESS" context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}copied2.1- 2.1.1FS6km40Slgy4DKab5lkmWait until the instance is running
2.1.1
Wait until the instance is running
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.It takes a few seconds for the instance to come up. Keep checking the status for a few seconds. After a specified number of iterations, just give up and print an error message.
inputsoutputsimport time op = "" iter = 0 if "running" not in op and iter < 3: cmd = f'aws ec2 describe-instances --instance-ids {instance_id} --query "Reservations[].Instances[].State.Name" --output text' op = _exe(None, cmd) iter = iter + 1 time.sleep(60) cmd = f'aws ec2 describe-instances --instance-ids {instance_id} --query "Reservations[].Instances[].State.Name" --output text' op = _exe(None, cmd) task_title = context.task_title if "running" not in op: msg = f"Giving up. The host doesn't seem to be coming up: {instance_id}" msg_type = "ERROR" print(msg) context.proceed = False else: msg = f"This instance {instance_id} is now running" msg_type = "SUCCESS" print(msg) context.log(msg_type, msg) context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}copied2.1.1 - 2.1.2vvE6yWNtxdJIwucjupz2Mount the volumes if needed
2.1.2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputsmount /dev/xvdb /datacopied2.1.2
- 3oKiwXfy1RllW6KQnvjukEnsure that the application is running
3
Ensure that the application is running
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Check if our application is running. Check if all the services are running as expected.
inputsoutputscmd = f'cd /home/ubuntu/dagknows_src/app_docker_compose_build_deploy' op = _exei(instance_id, cmd) cmd = f'sudo docker-compose -f {docker_compose_file} ps' services = [ "postgres", "adminer", "elasticsearch", "documentation", "req-router", "conv-mgr", "apigateway", "ansi-processing", "settings", "conv-sse", "proxy-sse", "nlp", "dag", "nginx" ] broken_services = [] _problem = False for service in services: cmd1 = cmd + f' {service}' op1 = _exei(instance_id, cmd1) print(op1) if 'Up' not in op1: broken_services.append(service) _problem = True if broken_services: msg = f"Broken services: {broken_services}" msg_type = "ERROR" else: msg = "All the services are up and running" msg_type = "SUCCESS" context.log(msg_type, msg) print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} app_is_up = not _problem context.skip_sub_tasks = app_is_upcopied3- 3.1zEsl0ddde1ukNxwPig2LStart the application
3.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputsop = _exei(instance_id, "cd /home/ubuntu/dagknows_src/app_docker_compose_build_deploy") cmd = f'make runstaging' op = _exei(instance_id, cmd) time.sleep(30) msg = "Restarting the application" msg_type = "INFO" print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}copied3.1
- 4E58oFmescP26k4w0vjXYEnsure that the application is reachable
4
Ensure that the application is reachable
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Check if the docker compose file is exposing the relevant port to reach to the application.
inputsoutputsport_str = str(int(port)) url = f"https://staging.dagknows.com:{port_str}" import requests unreachable = False try: resp = requests.get(url, timeout=5) unreachable = (resp.status_code != 200) except: unreachable = True if unreachable: msg = "Application unreachable!" msg_type = "ERROR" else: msg = "Application is reachable" msg_type = "SUCCESS" context.log(msg_type, msg) print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} context.skip_sub_tasks = not unreachablecopied4- 4.1hxzkCrMMd9rBWB89v0UqEnsure the container port is exposed
4.1
Ensure the container port is exposed
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.First check if the container port is exposed
inputsoutputsservice = "nginx" op = _exei(instance_id, "cd /home/ubuntu/dagknows_src/app_docker_compose_build_deploy") # First get the container associated with service cmd = f"sudo docker-compose -f {docker_compose_file} ps -q {service}" op = _exei(instance_id, cmd) container_id = op.split('\n')[-2].strip() # Now inspect the container cmd = f"sudo docker inspect {container_id}" op1 = _exe(instance_id, cmd) import json jsn = json.loads(op1) #print(json.dumps(jsn, indent=4)) port_str = f"{port}/tcp" exposed_ports = list(jsn[0]['Config']['ExposedPorts'].keys()) print(exposed_ports) if port_str in exposed_ports: port_is_exposed = True msg = f"Port : {port_str} is exposed by the service {service}" msg_type = "SUCCESS" else: port_is_exposed = False msg = f"Port : {port_str} is NOT exposed by the service {service}" msg_type = "ERROR" context.log(msg_type, msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} context.skip_sub_tasks = port_is_exposedcopied4.1- 4.1.1Q5XUo4q5IOpSSACjm3FDExpose the container port
4.1.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputs# Here we need to edit the docker-compose file and expose the container portcopied4.1.1
- 4.2m1DLfl3VsiWv3lLMNTAuEnsure that the instance port is exposed
4.2
Ensure that the instance port is exposed
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Check if the instance port is exposed in security groups. If not, modify the security group to expose the port.
inputsoutputscmd = f"aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[0].Instances[0].SecurityGroups[*].GroupId' --output text" op = _exe(None, cmd) security_group_list = op.strip() security_group_list = security_group_list.split() print(security_group_list)copied4.2- 4.2.1seDVbAVjcqDLl5WWHRoGCheck if the instance port is open
4.2.1
Check if the instance port is open
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputsimport json import copy #port = int(port) sgl = copy.deepcopy(security_group_list) security_group_id = sgl[0] is_allowed = False for sg_id in security_group_list: cmd = f'aws ec2 describe-security-groups --filter Name=group-id,Values={sg_id} --query SecurityGroups[*].IpPermissions[*]' op = _exe(None, cmd) json_op = json.loads(op) for sg in json_op: for rule in sg: if 'FromPort' in rule: port_lo = int(rule['FromPort']) port_hi = port_lo if 'ToPort' in rule: port_hi = int(rule['ToPort']) if port >= port_lo and port <= port_hi: is_allowed = True if not is_allowed: msg = f"The inbound port {port} is BLOCKED by security groups {security_group_list}" msg_type = "ERROR" else: msg = f"The inbound port {port} is ALLOWED by security groups {security_group_list}" msg_type = "SUCCESS" context.log(msg_type, msg) print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} context.skip_sub_tasks = is_allowedcopied4.2.1- 4.2.1.1IiW5Rv4LKUDxxgha3fRZExpose an inbound port on an ec2 instance
4.2.1.1
Expose an inbound port on an ec2 instance
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputs#port = int(port) cmd = f"aws ec2 authorize-security-group-ingress --group-id {security_group_id} --protocol tcp --port {port} --cidr 0.0.0.0/0" op = _exe(None, cmd) msg = f"allowing port {port} in security group {security_group_id}" print(msg) msg_type = "SUCCESS" context.log(msg_type, msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}copied4.2.1.1