You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make the Fetch open non-draft PR queue pre-fetch step tolerate transient GitHub GraphQL errors — a single 502 Bad Gateway from api.github.com/graphql currently fails the entire workflow before the agent runs. PR Sous Chef failed twice in the last 6h window (2/2 scheduled runs), both before the agent executed (0 tokens consumed), purely because a transient GitHub API blip aborted the deterministic pre-fetch shell step.
The Fetch open non-draft PR queue step runs under bash -e and calls gh api .../graphql to build the candidate PR queue. When GitHub returns a transient HTTP 502: 502 Bad Gateway (https://api.github.com/graphql), the unguarded gh invocation exits non-zero, bash -e propagates it, and the step ends with ##[error]Process completed with exit code 1. The agent job is marked failure even though nothing in the workflow logic or model was at fault — the agent never started.
audit-diff of the two runs shows no firewall anomalies, identical API-call volume (11 calls each), and 0 token usage in both — confirming the agent never ran and the failure is entirely in the pre-fetch shell step, not in agent behavior or model/provider.
Probable root cause
The pre-fetch step issues gh api/gh aw checks calls with no retry or soft-fail wrapper for the primary queue query. Some sub-queries already use 2>/dev/null || echo "..." fallbacks (per-PR checks/comments), but the top-level GraphQL queue fetch is not guarded, so any transient 5xx from GitHub aborts the whole step under bash -e.
Proposed remediation
Wrap the top-level GraphQL queue fetch in a bounded retry (e.g. 3 attempts with backoff) for transient 5xx/502/timeout responses.
On exhausted retries, soft-fail: emit an empty eligible queue (eligible_count=0) and exit 0 so the run is a clean no-op rather than a red failure — matching how the per-PR sub-queries already degrade.
Optionally classify transient GitHub-API pre-fetch errors so they are not counted as agent/workflow defects in failure reporting.
Success criteria / verification
A simulated/transient 502 on the queue GraphQL call no longer fails the agent job; the run completes as a no-op (eligible_count=0) or after a successful retry.
PR Sous Chef scheduled runs stop showing failure at Fetch open non-draft PR queue for transient GitHub API errors over a subsequent 24–48h window.
The original failure was a transient GitHub GraphQL 502 in PR Sous Chef pre-fetch. Sous Chef ran green across the 5 most recent scheduled runs on 2026-06-22 (01:01–07:18Z).
Recent runs (5): all success.
Root cause was upstream/transient (GraphQL 502), not a workflow defect.
Closing as stale/resolved. Reopen if pre-fetch hard-fails on transient 5xx again — the durable fix (retry/soft-fail on 5xx in pre-fetch) remains worth tracking if recurrence resumes.
Auto-closed by 6h Failure Investigator after correlating fresh run history.
Make the
Fetch open non-draft PR queuepre-fetch step tolerate transient GitHub GraphQL errors — a single502 Bad Gatewayfromapi.github.com/graphqlcurrently fails the entire workflow before the agent runs. PR Sous Chef failed twice in the last 6h window (2/2 scheduled runs), both before the agent executed (0 tokens consumed), purely because a transient GitHub API blip aborted the deterministic pre-fetch shell step.Parent: #39883 (6h failure investigation report).
Problem statement
The
Fetch open non-draft PR queuestep runs underbash -eand callsgh api .../graphqlto build the candidate PR queue. When GitHub returns a transientHTTP 502: 502 Bad Gateway (https://api.github.com/graphql), the unguardedghinvocation exits non-zero,bash -epropagates it, and the step ends with##[error]Process completed with exit code 1. The agent job is markedfailureeven though nothing in the workflow logic or model was at fault — the agent never started.Affected workflow and run IDs
HTTP 502 ... api.github.com/graphql→ exit 1audit-diffof the two runs shows no firewall anomalies, identical API-call volume (11 calls each), and 0 token usage in both — confirming the agent never ran and the failure is entirely in the pre-fetch shell step, not in agent behavior or model/provider.Probable root cause
The pre-fetch step issues
gh api/gh aw checkscalls with no retry or soft-fail wrapper for the primary queue query. Some sub-queries already use2>/dev/null || echo "..."fallbacks (per-PR checks/comments), but the top-level GraphQL queue fetch is not guarded, so any transient 5xx from GitHub aborts the whole step underbash -e.Proposed remediation
5xx/502/timeoutresponses.eligible_count=0) and exit0so the run is a clean no-op rather than a red failure — matching how the per-PR sub-queries already degrade.Success criteria / verification
502on the queue GraphQL call no longer fails theagentjob; the run completes as a no-op (eligible_count=0) or after a successful retry.failureatFetch open non-draft PR queuefor transient GitHub API errors over a subsequent 24–48h window.Related to [aw-failures] [aw] Failure Investigation Report — 6h window (2026-06-17 19:34 UTC) #39883
Closing — transient, resolved (fresh evidence 2026-06-22)
The original failure was a transient GitHub GraphQL 502 in PR Sous Chef pre-fetch. Sous Chef ran green across the 5 most recent scheduled runs on 2026-06-22 (01:01–07:18Z).
success.Closing as stale/resolved. Reopen if pre-fetch hard-fails on transient 5xx again — the durable fix (retry/soft-fail on 5xx in pre-fetch) remains worth tracking if recurrence resumes.
Auto-closed by 6h Failure Investigator after correlating fresh run history.