Skip to content

[aw-failures] [aw] PR Sous Chef hard-fails on transient GitHub GraphQL 502 in pre-fetch step — agent never runs #40310

@github-actions

Description

@github-actions

Make the Fetch open non-draft PR queue pre-fetch step tolerate transient GitHub GraphQL errors — a single 502 Bad Gateway from api.github.com/graphql currently fails the entire workflow before the agent runs. PR Sous Chef failed twice in the last 6h window (2/2 scheduled runs), both before the agent executed (0 tokens consumed), purely because a transient GitHub API blip aborted the deterministic pre-fetch shell step.

Parent: #39883 (6h failure investigation report).

Problem statement

The Fetch open non-draft PR queue step runs under bash -e and calls gh api .../graphql to build the candidate PR queue. When GitHub returns a transient HTTP 502: 502 Bad Gateway (https://api.github.com/graphql), the unguarded gh invocation exits non-zero, bash -e propagates it, and the step ends with ##[error]Process completed with exit code 1. The agent job is marked failure even though nothing in the workflow logic or model was at fault — the agent never started.

Affected workflow and run IDs

Run Created (UTC) Engine Failing step Tokens Signature
§27827037838 2026-06-19 12:56 copilot Fetch open non-draft PR queue 0 HTTP 502 ... api.github.com/graphql → exit 1
§27823332294 2026-06-19 11:35 copilot Fetch open non-draft PR queue 0 same step, agent never executed

audit-diff of the two runs shows no firewall anomalies, identical API-call volume (11 calls each), and 0 token usage in both — confirming the agent never ran and the failure is entirely in the pre-fetch shell step, not in agent behavior or model/provider.

Probable root cause

The pre-fetch step issues gh api/gh aw checks calls with no retry or soft-fail wrapper for the primary queue query. Some sub-queries already use 2>/dev/null || echo "..." fallbacks (per-PR checks/comments), but the top-level GraphQL queue fetch is not guarded, so any transient 5xx from GitHub aborts the whole step under bash -e.

Proposed remediation

  1. Wrap the top-level GraphQL queue fetch in a bounded retry (e.g. 3 attempts with backoff) for transient 5xx/502/timeout responses.
  2. On exhausted retries, soft-fail: emit an empty eligible queue (eligible_count=0) and exit 0 so the run is a clean no-op rather than a red failure — matching how the per-PR sub-queries already degrade.
  3. Optionally classify transient GitHub-API pre-fetch errors so they are not counted as agent/workflow defects in failure reporting.

Success criteria / verification

  • A simulated/transient 502 on the queue GraphQL call no longer fails the agent job; the run completes as a no-op (eligible_count=0) or after a successful retry.
  • PR Sous Chef scheduled runs stop showing failure at Fetch open non-draft PR queue for transient GitHub API errors over a subsequent 24–48h window.
  • No change to behavior when the GraphQL call succeeds (queue is built as before).
    Related to [aw-failures] [aw] Failure Investigation Report — 6h window (2026-06-17 19:34 UTC) #39883

Generated by 🔍 [aw] Failure Investigator (6h) · 115.7 AIC · ⌖ 13.1 AIC · ⊞ 4.9K ·

  • expires on Jun 26, 2026, 6:03 AM UTC-08:00

Closing — transient, resolved (fresh evidence 2026-06-22)

The original failure was a transient GitHub GraphQL 502 in PR Sous Chef pre-fetch. Sous Chef ran green across the 5 most recent scheduled runs on 2026-06-22 (01:01–07:18Z).

  1. Recent runs (5): all success.
  2. Root cause was upstream/transient (GraphQL 502), not a workflow defect.

Closing as stale/resolved. Reopen if pre-fetch hard-fails on transient 5xx again — the durable fix (retry/soft-fail on 5xx in pre-fetch) remains worth tracking if recurrence resumes.

Auto-closed by 6h Failure Investigator after correlating fresh run history.

Generated by 🔍 [aw] Failure Investigator (6h) · 253.8 AIC · ⊞ 4.9K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions