[aw-failures] [aw] PR Sous Chef hard-fails on transient GitHub GraphQL 502 in pre-fetch step — agent never runs

**Make the `Fetch open non-draft PR queue` pre-fetch step tolerate transient GitHub GraphQL errors — a single `502 Bad Gateway` from `api.github.com/graphql` currently fails the entire workflow before the agent runs.** PR Sous Chef failed twice in the last 6h window (2/2 scheduled runs), both before the agent executed (0 tokens consumed), purely because a transient GitHub API blip aborted the deterministic pre-fetch shell step.

Parent: #39883 (6h failure investigation report).

### Problem statement

The `Fetch open non-draft PR queue` step runs under `bash -e` and calls `gh api .../graphql` to build the candidate PR queue. When GitHub returns a transient `HTTP 502: 502 Bad Gateway (https://api.github.com/graphql)`, the unguarded `gh` invocation exits non-zero, `bash -e` propagates it, and the step ends with `##[error]Process completed with exit code 1`. The agent job is marked `failure` even though nothing in the workflow logic or model was at fault — the agent never started.

### Affected workflow and run IDs

| Run | Created (UTC) | Engine | Failing step | Tokens | Signature |
|---|---|---|---|---|---|
| [§27827037838](https://github.com/github/gh-aw/actions/runs/27827037838) | 2026-06-19 12:56 | copilot | Fetch open non-draft PR queue | 0 | `HTTP 502 ... api.github.com/graphql` → exit 1 |
| [§27823332294](https://github.com/github/gh-aw/actions/runs/27823332294) | 2026-06-19 11:35 | copilot | Fetch open non-draft PR queue | 0 | same step, agent never executed |

`audit-diff` of the two runs shows **no firewall anomalies**, identical API-call volume (11 calls each), and **0 token usage in both** — confirming the agent never ran and the failure is entirely in the pre-fetch shell step, not in agent behavior or model/provider.

### Probable root cause

The pre-fetch step issues `gh api`/`gh aw checks` calls with no retry or soft-fail wrapper for the primary queue query. Some sub-queries already use `2>/dev/null || echo "..."` fallbacks (per-PR checks/comments), but the **top-level GraphQL queue fetch is not guarded**, so any transient 5xx from GitHub aborts the whole step under `bash -e`.

### Proposed remediation

1. Wrap the top-level GraphQL queue fetch in a bounded retry (e.g. 3 attempts with backoff) for transient `5xx`/`502`/`timeout` responses.
2. On exhausted retries, **soft-fail**: emit an empty eligible queue (`eligible_count=0`) and exit `0` so the run is a clean no-op rather than a red failure — matching how the per-PR sub-queries already degrade.
3. Optionally classify transient GitHub-API pre-fetch errors so they are not counted as agent/workflow defects in failure reporting.

### Success criteria / verification

- A simulated/transient `502` on the queue GraphQL call no longer fails the `agent` job; the run completes as a no-op (`eligible_count=0`) or after a successful retry.
- PR Sous Chef scheduled runs stop showing `failure` at `Fetch open non-draft PR queue` for transient GitHub API errors over a subsequent 24–48h window.
- No change to behavior when the GraphQL call succeeds (queue is built as before).
Related to #39883







> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/27829815434) · 115.7 AIC · ⌖ 13.1 AIC · ⊞ 4.9K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on Jun 26, 2026, 6:03 AM UTC-08:00






---

### Closing — transient, resolved (fresh evidence 2026-06-22)

The original failure was a **transient GitHub GraphQL 502** in PR Sous Chef pre-fetch. Sous Chef ran **green across the 5 most recent scheduled runs on 2026-06-22** (01:01–07:18Z).

1. Recent runs (5): all `success`.
2. Root cause was upstream/transient (GraphQL 502), not a workflow defect.

Closing as stale/resolved. Reopen if pre-fetch hard-fails on transient 5xx again — the durable fix (retry/soft-fail on 5xx in pre-fetch) remains worth tracking if recurrence resumes.

_Auto-closed by 6h Failure Investigator after correlating fresh run history._

> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/27940490237) · 253.8 AIC · ⊞ 4.9K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] [aw] PR Sous Chef hard-fails on transient GitHub GraphQL 502 in pre-fetch step — agent never runs #40310

Problem statement

Affected workflow and run IDs

Probable root cause

Proposed remediation

Success criteria / verification

Closing — transient, resolved (fresh evidence 2026-06-22)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run	Created (UTC)	Engine	Failing step	Tokens	Signature
§27827037838	2026-06-19 12:56	copilot	Fetch open non-draft PR queue	0	`HTTP 502 ... api.github.com/graphql` → exit 1
§27823332294	2026-06-19 11:35	copilot	Fetch open non-draft PR queue	0	same step, agent never executed

[aw-failures] [aw] PR Sous Chef hard-fails on transient GitHub GraphQL 502 in pre-fetch step — agent never runs #40310

Description

Problem statement

Affected workflow and run IDs

Probable root cause

Proposed remediation

Success criteria / verification

Closing — transient, resolved (fresh evidence 2026-06-22)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions