Add a Copilot SDK idle/turn watchdog that aborts (or recovers) a stalled session well before the 25-min action timeout — Formal Spec Verifier hangs after report_intent and has produced nothing for 5 straight days.
Problem statement
The Copilot CLI agent stalls early in its first turn and goes completely silent until the 25-minute action wrapper kills the step. No safe-output is produced and no work completes — a genuine mid-task hang, not a post-completion false-failure.
Affected workflow and run IDs
- Daily Formal Spec Verifier (
.github/workflows/daily-formal-spec-verifier.lock.yml, Copilot engine, model claude-sonnet-4.6) — 5/5 consecutive scheduled failures.
- Representative (in-window): §27504260264 (2026-06-14T15:58Z).
- Prior streak: §27471596105, §27428819181, §27362596806, §27290883003 (2026-06-10 → 06-13).
- Last success comparator: §27220052669 (2026-06-09).
Root cause
The agent ran a few tool calls in its first ~3 minutes, then stopped all activity for ~24 minutes until the action timeout fired:
16:01:23Z tool.execution_complete report_intent success=true
... (no further tool calls, no inference progress for ~24 min) ...
16:25:49Z Process exiting with code: 1
16:25:49Z ##[error]The action 'Execute GitHub Copilot CLI' has timed out after 25 minutes.
The Copilot SDK driver idle-hung after report_intent and never advanced; audit-diff vs the last success confirms a single 29m41s turn with only 2,430 output tokens / 14 bash calls / 67.2 AIC consumed and no firewall anomalies — i.e. the run burned a full action slot producing nothing. This is the same idle-hang family as #39200 (Dictation Prompt Generator session.idle timeout) but distinct: there the agent finished its work and created the PR before hanging, whereas here it stalls mid-task with no safe-output, so it is a true outage, not a false-failure.
Proposed remediation
- Add an inference/turn-level idle watchdog in the Copilot SDK driver that detects no tool-call or token progress for N minutes and aborts (or restarts the turn) with a classified flag (e.g.
GH_AW_AGENTIC_ENGINE_IDLE_HANG), rather than letting the 25-min action wrapper be the only backstop.
- Investigate why the SDK driver stalls immediately after
report_intent on this workflow (possible deadlock between report_intent handling and the next inference request).
- As a stopgap, lower this workflow's per-action timeout so a hung run fails fast and frees the runner sooner.
Success criteria / verification
- Daily Formal Spec Verifier completes (emits its
create_issue/noop safe-output) for ≥3 consecutive scheduled runs.
- No 24-minute post-
report_intent silence followed by a 25-min action timeout.
- An idle-hang, if it recurs, is surfaced as a dedicated classified flag rather than a generic engine timeout.
Parent: #29109. Analyzed runs: 27504260264 (audit), 27220052669 (audit-diff baseline).
Related to #29109
Generated by 🔍 [aw] Failure Investigator (6h) · 457 AIC · ⌖ 12 AIC · ⊞ 4.5K · ◷
Add a Copilot SDK idle/turn watchdog that aborts (or recovers) a stalled session well before the 25-min action timeout — Formal Spec Verifier hangs after
report_intentand has produced nothing for 5 straight days.Problem statement
The Copilot CLI agent stalls early in its first turn and goes completely silent until the 25-minute action wrapper kills the step. No safe-output is produced and no work completes — a genuine mid-task hang, not a post-completion false-failure.
Affected workflow and run IDs
.github/workflows/daily-formal-spec-verifier.lock.yml, Copilot engine, modelclaude-sonnet-4.6) — 5/5 consecutive scheduled failures.Root cause
The agent ran a few tool calls in its first ~3 minutes, then stopped all activity for ~24 minutes until the action timeout fired:
The Copilot SDK driver idle-hung after
report_intentand never advanced;audit-diffvs the last success confirms a single 29m41s turn with only 2,430 output tokens / 14 bash calls / 67.2 AIC consumed and no firewall anomalies — i.e. the run burned a full action slot producing nothing. This is the same idle-hang family as #39200 (Dictation Prompt Generatorsession.idletimeout) but distinct: there the agent finished its work and created the PR before hanging, whereas here it stalls mid-task with no safe-output, so it is a true outage, not a false-failure.Proposed remediation
GH_AW_AGENTIC_ENGINE_IDLE_HANG), rather than letting the 25-min action wrapper be the only backstop.report_intenton this workflow (possible deadlock betweenreport_intenthandling and the next inference request).Success criteria / verification
create_issue/noopsafe-output) for ≥3 consecutive scheduled runs.report_intentsilence followed by a 25-min action timeout.Parent: #29109. Analyzed runs: 27504260264 (audit), 27220052669 (audit-diff baseline).
Related to #29109