agent: | Auto Exec |
What is an "Expert"? How do we create our own expert?
Add credentials for various integrations
Managing workspaces and access control
DagKnows Architecture Overview
Setting up SSO via Azure AD for Dagknows
Enable "Auto Exec" and "Send Execution Result to LLM" in "Adjust Settings" if desired
(Optionally) Add ubuntu user to docker group and refresh group membership
Deployment of an EKS Cluster with Worker Nodes in AWS
Adding, Deleting, Listing DagKnows Proxy credentials or key-value pairs
Comprehensive AWS Security and Compliance Evaluation Workflow (SOC2 Super Runbook)
AWS EKS Version Update 1.29 to 1.30 via terraform
Instruction to allow WinRM connection
MSP Usecase: User Onboarding Azure + M365
Post a message to a Slack channel
How to debug a kafka cluster and kafka topics?
Open VPN Troubleshooting (Powershell)
Execute a simple task on the proxy
Assign the proxy role to a user
Create roles to access credentials in proxy
Install OpenVPN client on Windows laptop
Setup Kubernetes kubectl and Minikube on Ubuntu 22.04 LTS
Install Prometheus and Grafana on the minikube cluster on EC2 instance in the monitoring namespace
update the EKS versions in different clusters
AI agent session 2024-09-12T09:36:14-07:00 by Sarang Dharmapurikar
Parse EDN content and give a JSON out
Check whether a user is there on Azure AD and if the user account status is enabled
Get the input parameters of a Jenkins pipeline
Check if the node went down
For nodes that are in a non-ready state, checks which of them went down by checking for their Node IPs
- 1t4N0rd4xF38AXoa1g7pnkubelet issue
1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Runs checks to see node readyness issues are due to the Kubelet. Some kubelet issues include unresponsive kubelets, dead kubelets or kubelets that have exited soon.
inputsoutputs1- 1.1lWiJY2yVCuuRomG5HSN5Kubelet stopped posting node status
1.1
Kubelet stopped posting node status
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Returns nodes where kubelet had stopped posting status updates.
inputsoutputscontext.skip_sub_tasks = True no_status_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename}" op = _exe(master_ip, cmd) if "kubelet stopped posting node status" in op.lower(): context.skip_sub_tasks = False no_status_nodes.append(node) print(no_status_nodes)copied1.1- 1.1.1R6CHVIIkBPQtZfWnDAdBKubelet Dead
1.1.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Identifies nodes where the Kubelet is in an inactive or dead state.
inputsoutputsdead_kubelet_nodes = [] for node in no_status_nodes: instance_id = node['instance_id'] cmd = "sudo systemctl status kubelet | grep 'Active:'" op = _exe(instance_id, cmd) if "inactive (dead)" in op: dead_kubelet_nodes.append(node) context.skip_sub_tasks = len(dead_kubelet_nodes) == 0 print(dead_kubelet_nodes)copied1.1.1 - 1.1.2IbAsoCsZhM2ewyCX7GaVKubelet exited
1.1.2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Identifies kubelets that are in active but exited state.
inputsoutputsexited_kubelet_nodes = [] for node in no_status_nodes: instance_id = node['instance_id'] cmd = "sudo systemctl status kubelet | grep 'Active:'" op = _exe(instance_id, cmd) if "active (exited)" in op: exited_kubelet_nodes.append(node) context.skip_sub_tasks = len(exited_kubelet_nodes) == 0 print(exited_kubelet_nodes)copied1.1.2 - 1.1.3NgnZ2L0OqiMQIpG5LCfWWorker kubelet unable to update master
1.1.3
Worker kubelet unable to update master
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Gets the master instance IP addresses in the cluster and uses it to perform connectivity checks.
inputsoutputsfor node in no_status_nodes: nodeip = node['nodeip'] cmd = "sudo systemctl status kubelet" op = _exe(nodeip, cmd) #_problem = False if "Unable to register node with API server" in op: pass master_ip_addr = _get_ip_addr(master_ip) cmd1 = f'aws ec2 describe-instances --filters "Name=ip-address,Values={master_ip_addr}" --query "Reservations[*].Instances[*].InstanceId" --output text' master_instance_ids = [_exe(None, cmd1).strip()] print(master_instance_ids)copied1.1.3- 1.1.3.1xSYmJ2ECr0PZ9k9aTbFvSecurity group blocked the API request
1.1.3.1
Security group blocked the API request
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Looks at the security groups of the master instances (by their IDs) to check if there are possible port configuration mismatches preventing connectivity.
inputsoutputsimport json for instance_id in master_instance_ids: _problem = True cmd = f"aws ec2 describe-instances --instance-ids {instance_id} --query 'Reservations[*].Instances[*].SecurityGroups[*].GroupId' --output=text" sg_ids1 = _exe(None, cmd) print(sg_ids1) sg_ids = re.split('\s',sg_ids1.strip()) if sg_ids: for sg_id in sg_ids: if not sg_id: continue cmd1 = 'aws ec2 describe-security-groups --filter Name=group-id,Values=' cmd1+= sg_id cmd1+= ' --query SecurityGroups[*].IpPermissions[*]' op = _exe(None, cmd1) json_op = json.loads(op) for sg in json_op: for rule in sg: if 'FromPort' in rule: port_lo = int(rule['FromPort']) port_hi = port_lo if 'ToPort' in rule: port_hi = int(rule['ToPort']) if port >= port_lo and port <= port_hi: _problem = False else: continue if _problem: break if _problem: context.log("ERROR", "Found problem")copied1.1.3.1
- 2N4hnHypnj6NxnlLCWAS5Kube-proxy issue
2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.2- 2.1qwIuEDWW3beQ089pLlfkkube-proxy not running
2.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.inputsoutputscmd = f"kubectl get pods -n kube-system -o wide | grep kube-proxy | grep Running" op = _exe(master_ip, cmd) kube_proxy_down_nodes = [] for node in problem_nodes: nodeip = node['nodeip'] if nodeip not in op: context.log("ERROR", "Found problem node. nodeip missing") kube_proxy_down_nodes.append(node) print("Down Nodes: ", kube_proxy_down_nodes)copied2.1
- 3zXuE3xBgzpA4ZB3DDVl4System resources issues
3
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.3- 3.1xMhWpsF8ZXCSDAu1oa45MemoryPressure
3.1
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Checks if any of the nodes are constrained on memory (experiencing MemoryPressure)
inputsoutputsmem_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep MemoryPressure" op = _exe(master_ip, cmd) mem_pressure = op.split()[1] if mem_pressure.lower() == "true": mem_pressure_nodes.append(node) if mem_pressure_nodes: context.log("ERROR", "Mem Pressure Reached") print("Mem Pressure Nodes: ", mem_pressure_nodes)copied3.1 - 3.2OOvNzV1q99d8jz2mXJWnDiskPressure
3.2
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Checks if any of the nodes are constrained on disk space (experiencing DiskPressure)
inputsoutputsdisk_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep DiskPressure" op = _exe(master_ip, cmd) disk_pressure = op.split()[1] if disk_pressure.lower() == "true": disk_pressure_nodes.append(node) if disk_pressure_nodes: context.log("ERROR", "Disk Pressure Reached") print("Disk Pressure Nodes: ", disk_pressure_nodes)copied3.2 - 3.3F7DA4ir5dBf9nRL9J2wxPIDPressure
3.3
There was a problem that the LLM was not able to address. Please rephrase your prompt and try again.Checks if any of the nodes are constrained on number of processes (experiencing PIDPressure)
inputsoutputspid_pressure_nodes = [] for node in problem_nodes: nodename = node['nodename'] cmd = f"kubectl describe node {nodename} | grep PIDPressure" op = _exe(master_ip, cmd) pid_pressure = op.split()[1] if pid_pressure.lower() == "true": pid_pressure_nodes.append(node) if pid_pressure_nodes: context.log("ERROR", "PID Pressure Reached") print("PID Pressure Nodes: ", pid_pressure_nodes)copied3.3