Skip to content

[aw-failures] [aw] PR Code Quality Reviewer fails on Copilot SDK session.idle timeout (870s) despite active agent execution #40418

@github-actions

Description

@github-actions

Parent: #39883 (6h failure investigation report).

Problem statement

The PR Code Quality Reviewer agent job failed at Execute GitHub Copilot CLI after the Copilot SDK driver hung for 14m30s waiting for a session.idle event and hit the 870000 ms idle-timeout watchdog. The agent was actively executing tools (bash/grep/view) right up to the timeout — hasOutput=true, 77.3k tokens, 7 tool types, 1 turn — so this is a hang in the SDK idle-signal path, not an agent-logic failure.

Affected workflow and run IDs

  • Workflow: PR Code Quality Reviewer (.github/workflows/pr-code-quality-reviewer.lock.yml), engine copilot CLI, trigger pull_request, branch copilot/add-codemod-rename-allowed-team-members.
  • Failed run: §27852098051 — 2026-06-19 22:56 UTC. duration 19.4m, turns=1, 77.3k tokens, exit code 1.
  • Nearest successful baseline (same config, 12 min earlier): 27851753205 — success, audit classification stable. → intermittent, not a config regression.

Evidence (agent-stdio.log)

[copilot-sdk-driver] [sdk-driver] error: Timeout after 870000ms waiting for session.idle
[copilot-harness] attempt 1: process closed exitCode=1 duration=14m 30s stdout=0B stderr=505092B hasOutput=true
[copilot-harness] attempt 1 failed: exitCode=1 failureClass=sdk_session_idle_timeout ... isSDKSessionIdleTimeoutError=true ... retriesRemaining=3

Probable root cause

The SDK driver waits for a session.idle event that never arrives while the agent is mid-execution: the final tool calls complete but no idle signal is emitted, so the 870s watchdog fires and exits 1. This is distinct from #39946 (where all classifiers are false) — here the failure is explicitly classified sdk_session_idle_timeout. Likely a missed/dropped idle event in the SDK driver on long single-turn runs.

Proposed remediation

  1. Investigate why session.idle is not emitted after the final tool completes on long turns; resolve the driver on terminal tool completion, not only on an idle event.
  2. The attempt reports retriesRemaining=3 yet the run ended in failure — confirm the sdk_session_idle_timeout retry path actually re-runs the turn; if not, fix it.
  3. Preserve and surface the partial agent output on idle-timeout instead of discarding it.

Success criteria / verification

Generated by 🔍 [aw] Failure Investigator (6h) · 172.3 AIC · ⌖ 11.7 AIC · ⊞ 4.9K ·

  • expires on Jun 26, 2026, 5:40 PM UTC-08:00

Closing — resolved (fresh evidence 2026-06-22)

PR Code Quality Reviewer ran green across the 6 most recent runs on 2026-06-22 (02:00–06:25Z, pull_request). The Copilot SDK session.idle 870s timeout no longer reproduces.

  1. Recent runs (6): all success, including 2026-06-22 06:25Z.
  2. No session.idle timeout observed in current run history.

Closing as fixed. Reopen if the idle-timeout reappears.

Auto-closed by 6h Failure Investigator after correlating fresh run history.

Generated by 🔍 [aw] Failure Investigator (6h) · 253.8 AIC · ⊞ 4.9K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions