Sign in
agent:

Application not responding

There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

This runbooks is an end to end health check automation for our application.

  1. 1

    Get instance id of a hostname

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    This task just gets the ec2 instance id for a given host based on the label.

    cmd = f'aws ec2 describe-instances --filters "Name=tag:Name,Values={hostname}" --query "Reservations[].Instances[].InstanceId" --output text' op = _exe(None, cmd) instance_id = op.strip() context.problem = True print(f"Instance id for {hostname} is {instance_id}")
    copied
    1
  2. 2

    Ensure that the instance is running

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Get the instance state and check if the instance is running. If it is not running, start it.

    cmd = f'aws ec2 describe-instances --instance-ids {instance_id} --query "Reservations[].Instances[].State.Name" --output text' op = _exe(None, cmd) _problem = True if "running" in op: _problem = False _proceed = not _problem host_is_up = not _problem if host_is_up: msg = "Host is up" msg_type = "SUCCESS" else: msg = "Host is not up!" msg_type = "ERROR" print(msg) context.log(msg_type, msg) context.skip_sub_tasks = host_is_up task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}
    copied
    2
    1. 2.1

      Start an ec2 instance

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      cmd = f'aws ec2 start-instances --instance-ids {instance_id}' op = _exe(None, cmd) msg = "Instance was down. Restarted the instance" print(msg) task_title = context.task_title msg_type = "SUCCESS" context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}
      copied
      2.1
      1. 2.1.1

        Wait until the instance is running

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

        It takes a few seconds for the instance to come up. Keep checking the status for a few seconds. After a specified number of iterations, just give up and print an error message.

        import time op = "" iter = 0 if "running" not in op and iter < 3: cmd = f'aws ec2 describe-instances --instance-ids {instance_id} --query "Reservations[].Instances[].State.Name" --output text' op = _exe(None, cmd) iter = iter + 1 time.sleep(60) cmd = f'aws ec2 describe-instances --instance-ids {instance_id} --query "Reservations[].Instances[].State.Name" --output text' op = _exe(None, cmd) task_title = context.task_title if "running" not in op: msg = f"Giving up. The host doesn't seem to be coming up: {instance_id}" msg_type = "ERROR" print(msg) context.proceed = False else: msg = f"This instance {instance_id} is now running" msg_type = "SUCCESS" print(msg) context.log(msg_type, msg) context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}
        copied
        2.1.1
      2. 2.1.2

        Mount the volumes if needed

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        mount /dev/xvdb /data
        copied
        2.1.2
  3. 3

    Ensure that the application is running

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Check if our application is running. Check if all the services are running as expected.

    cmd = f'cd /home/ubuntu/dagknows_src/app_docker_compose_build_deploy' op = _exei(instance_id, cmd) cmd = f'sudo docker-compose -f {docker_compose_file} ps' services = [ "postgres", "adminer", "elasticsearch", "documentation", "req-router", "conv-mgr", "apigateway", "ansi-processing", "settings", "conv-sse", "proxy-sse", "nlp", "dag", "nginx" ] broken_services = [] _problem = False for service in services: cmd1 = cmd + f' {service}' op1 = _exei(instance_id, cmd1) print(op1) if 'Up' not in op1: broken_services.append(service) _problem = True if broken_services: msg = f"Broken services: {broken_services}" msg_type = "ERROR" else: msg = "All the services are up and running" msg_type = "SUCCESS" context.log(msg_type, msg) print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} app_is_up = not _problem context.skip_sub_tasks = app_is_up
    copied
    3
    1. 3.1

      Start the application

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
      op = _exei(instance_id, "cd /home/ubuntu/dagknows_src/app_docker_compose_build_deploy") cmd = f'make runstaging' op = _exei(instance_id, cmd) time.sleep(30) msg = "Restarting the application" msg_type = "INFO" print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}
      copied
      3.1
  4. 4

    Ensure that the application is reachable

    There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

    Check if the docker compose file is exposing the relevant port to reach to the application.

    port_str = str(int(port)) url = f"https://staging.dagknows.com:{port_str}" import requests unreachable = False try: resp = requests.get(url, timeout=5) unreachable = (resp.status_code != 200) except: unreachable = True if unreachable: msg = "Application unreachable!" msg_type = "ERROR" else: msg = "Application is reachable" msg_type = "SUCCESS" context.log(msg_type, msg) print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} context.skip_sub_tasks = not unreachable
    copied
    4
    1. 4.1

      Ensure the container port is exposed

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      First check if the container port is exposed

      service = "nginx" op = _exei(instance_id, "cd /home/ubuntu/dagknows_src/app_docker_compose_build_deploy") # First get the container associated with service cmd = f"sudo docker-compose -f {docker_compose_file} ps -q {service}" op = _exei(instance_id, cmd) container_id = op.split('\n')[-2].strip() # Now inspect the container cmd = f"sudo docker inspect {container_id}" op1 = _exe(instance_id, cmd) import json jsn = json.loads(op1) #print(json.dumps(jsn, indent=4)) port_str = f"{port}/tcp" exposed_ports = list(jsn[0]['Config']['ExposedPorts'].keys()) print(exposed_ports) if port_str in exposed_ports: port_is_exposed = True msg = f"Port : {port_str} is exposed by the service {service}" msg_type = "SUCCESS" else: port_is_exposed = False msg = f"Port : {port_str} is NOT exposed by the service {service}" msg_type = "ERROR" context.log(msg_type, msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} context.skip_sub_tasks = port_is_exposed
      copied
      4.1
      1. 4.1.1

        Expose the container port

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        # Here we need to edit the docker-compose file and expose the container port
        copied
        4.1.1
    2. 4.2

      Ensure that the instance port is exposed

      There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.

      Check if the instance port is exposed in security groups. If not, modify the security group to expose the port.

      cmd = f"aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[0].Instances[0].SecurityGroups[*].GroupId' --output text" op = _exe(None, cmd) security_group_list = op.strip() security_group_list = security_group_list.split() print(security_group_list)
      copied
      4.2
      1. 4.2.1

        Check if the instance port is open

        There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
        import json import copy #port = int(port) sgl = copy.deepcopy(security_group_list) security_group_id = sgl[0] is_allowed = False for sg_id in security_group_list: cmd = f'aws ec2 describe-security-groups --filter Name=group-id,Values={sg_id} --query SecurityGroups[*].IpPermissions[*]' op = _exe(None, cmd) json_op = json.loads(op) for sg in json_op: for rule in sg: if 'FromPort' in rule: port_lo = int(rule['FromPort']) port_hi = port_lo if 'ToPort' in rule: port_hi = int(rule['ToPort']) if port >= port_lo and port <= port_hi: is_allowed = True if not is_allowed: msg = f"The inbound port {port} is BLOCKED by security groups {security_group_list}" msg_type = "ERROR" else: msg = f"The inbound port {port} is ALLOWED by security groups {security_group_list}" msg_type = "SUCCESS" context.log(msg_type, msg) print(msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type} context.skip_sub_tasks = is_allowed
        copied
        4.2.1
        1. 4.2.1.1

          Expose an inbound port on an ec2 instance

          There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.
          #port = int(port) cmd = f"aws ec2 authorize-security-group-ingress --group-id {security_group_id} --protocol tcp --port {port} --cidr 0.0.0.0/0" op = _exe(None, cmd) msg = f"allowing port {port} in security group {security_group_id}" print(msg) msg_type = "SUCCESS" context.log(msg_type, msg) task_title = context.task_title context.job_context[task_title] = {"msg" : msg, "msg_type" : msg_type}
          copied
          4.2.1.1