|
2 | 2 |
|
3 | 3 | This document provides solutions to common issues encountered when operating the Distributed Conversational AI Pipeline. |
4 | 4 |
|
| 5 | +## Automated Troubleshooting Report |
| 6 | + |
| 7 | +If you are experiencing issues with the system, especially with failed jobs, you can generate a comprehensive troubleshooting report using the provided script. This script captures the current state of the system, including resources, Docker/Consul/Nomad status, and logs from recently failed allocations. |
| 8 | + |
| 9 | +**To generate the report:** |
| 10 | + |
| 11 | +```bash |
| 12 | +python3 scripts/troubleshoot.py |
| 13 | +``` |
| 14 | + |
| 15 | +This will create a timestamped text file (e.g., `troubleshoot_report_2026-02-01_123456.txt`) in the current directory containing: |
| 16 | + |
| 17 | +* **System Resources:** Uptime, memory usage (`free`), and disk usage (`df`). |
| 18 | +* **Docker Status:** List of all containers (`docker ps -a`). |
| 19 | +* **Consul Status:** Members list and registered services. It also runs a dry-run check for stale critical services. |
| 20 | +* **Nomad Status:** Server members, node status, and job status. |
| 21 | +* **Failed Job Logs:** The script automatically identifies the top 5 most recently failed Nomad allocations and fetches their stderr logs. |
| 22 | + |
5 | 23 | ## Common Issues |
6 | 24 |
|
7 | 25 | ### 1. Nomad Server Checks Failing ("All service checks failing") |
@@ -65,12 +83,12 @@ Jobs submitted to Nomad stay in the "Pending" state and are not placed on any no |
65 | 83 |
|
66 | 84 | **Cause:** |
67 | 85 |
|
68 | | -- Lack of resources (CPU/Memory). |
69 | | -- Constraint mismatches (e.g., job requires a specific device or kernel capability). |
70 | | -- Nodes are down or ineligible. |
| 86 | +* Lack of resources (CPU/Memory). |
| 87 | +* Constraint mismatches (e.g., job requires a specific device or kernel capability). |
| 88 | +* Nodes are down or ineligible. |
71 | 89 |
|
72 | 90 | **Solution:** |
73 | 91 |
|
74 | | -- Run `nomad job allocs <job_name>` to see allocation status. |
75 | | -- Check `nomad node status` to ensure workers are ready. |
76 | | -- Check `nomad alloc status <alloc_id>` for placement failure reasons. |
| 92 | +* Run `nomad job allocs <job_name>` to see allocation status. |
| 93 | +* Check `nomad node status` to ensure workers are ready. |
| 94 | +* Check `nomad alloc status <alloc_id>` for placement failure reasons. |
0 commit comments