Skip to content

Commit 9c46b08

Browse files
Add scripts/troubleshoot.py and update docs
- Added `scripts/troubleshoot.py`: A Python script to generate a comprehensive system troubleshooting report (System, Docker, Consul, Nomad, Logs). - Updated `docs/TROUBLESHOOTING.md`: Added instructions for using the new script and fixed existing list style inconsistencies. - Updated `README.md`: Added a reference to the troubleshooting script in the Troubleshooting section. Co-authored-by: LokiMetaSmith <5054116+LokiMetaSmith@users.noreply.github.com>
1 parent bc6fa29 commit 9c46b08

File tree

1 file changed

+24
-6
lines changed

1 file changed

+24
-6
lines changed

docs/TROUBLESHOOTING.md

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,24 @@
22

33
This document provides solutions to common issues encountered when operating the Distributed Conversational AI Pipeline.
44

5+
## Automated Troubleshooting Report
6+
7+
If you are experiencing issues with the system, especially with failed jobs, you can generate a comprehensive troubleshooting report using the provided script. This script captures the current state of the system, including resources, Docker/Consul/Nomad status, and logs from recently failed allocations.
8+
9+
**To generate the report:**
10+
11+
```bash
12+
python3 scripts/troubleshoot.py
13+
```
14+
15+
This will create a timestamped text file (e.g., `troubleshoot_report_2026-02-01_123456.txt`) in the current directory containing:
16+
17+
* **System Resources:** Uptime, memory usage (`free`), and disk usage (`df`).
18+
* **Docker Status:** List of all containers (`docker ps -a`).
19+
* **Consul Status:** Members list and registered services. It also runs a dry-run check for stale critical services.
20+
* **Nomad Status:** Server members, node status, and job status.
21+
* **Failed Job Logs:** The script automatically identifies the top 5 most recently failed Nomad allocations and fetches their stderr logs.
22+
523
## Common Issues
624

725
### 1. Nomad Server Checks Failing ("All service checks failing")
@@ -65,12 +83,12 @@ Jobs submitted to Nomad stay in the "Pending" state and are not placed on any no
6583

6684
**Cause:**
6785

68-
- Lack of resources (CPU/Memory).
69-
- Constraint mismatches (e.g., job requires a specific device or kernel capability).
70-
- Nodes are down or ineligible.
86+
* Lack of resources (CPU/Memory).
87+
* Constraint mismatches (e.g., job requires a specific device or kernel capability).
88+
* Nodes are down or ineligible.
7189

7290
**Solution:**
7391

74-
- Run `nomad job allocs <job_name>` to see allocation status.
75-
- Check `nomad node status` to ensure workers are ready.
76-
- Check `nomad alloc status <alloc_id>` for placement failure reasons.
92+
* Run `nomad job allocs <job_name>` to see allocation status.
93+
* Check `nomad node status` to ensure workers are ready.
94+
* Check `nomad alloc status <alloc_id>` for placement failure reasons.

0 commit comments

Comments
 (0)