Skip to content

fix(doc-maintainer): prevent maxRuns 403 from wasted shell turns#5564

Merged
lpcox merged 3 commits into
mainfrom
fix/doc-maintainer-maxruns-403-5552
Jun 26, 2026
Merged

fix(doc-maintainer): prevent maxRuns 403 from wasted shell turns#5564
lpcox merged 3 commits into
mainfrom
fix/doc-maintainer-maxruns-403-5552

Conversation

@lpcox

@lpcox lpcox commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Problem

The Documentation Maintainer workflow has failed every day since 2026-06-20 (last success 06-19), each time auto-filing a failure issue (#5552). The reported error is a misleading:

Authentication failed with provider at http://172.30.0.30:10002 (HTTP 403).
  Check your COPILOT_PROVIDER_API_KEY or COPILOT_PROVIDER_BEARER_TOKEN.

This is not a credentials problem — the api-proxy health-check passes and the BYOK key is correctly configured.

Root cause

  1. The agent (claude-haiku-4.5, max-turns: 8) tried to "verify code examples" by running shell commands — npm test, ls -la src/*.test.ts — but the workflow sets bash: false. Each attempt returned Permission denied and could not request permission from user, wasting turns and producing no output (Changes +0 -0).
  2. gh-aw maps max-turns → the api-proxy maxRuns hard cap. The run's reflect snapshot confirms exhaustion:
    "runs": { "max_runs": 8, "invocation_count": 8, "remaining_runs": 0 }
  3. Once the 8-invocation budget is exhausted, the api-proxy returns a terminal, non-retryable 403 (max_runs_exceeded — deliberately 403, not 429, so SDK clients don't retry-storm; see containers/api-proxy/guards/common-guard-checks.js). The Copilot CLI surfaces that 403 as a generic "Authentication failed" error, and the failure-issue reporter treats it as an engine crash.

So a benign turn-budget exhaustion (made worse by wasted shell attempts) masquerades as an auth failure.

Fix (workflow-scoped)

  • Stop wasting turns on disabled shell: the prompt now explicitly tells the agent it has no shell access and must not attempt git, npm test, ls, or any shell command, and that code-example verification is read-only.
  • Give budget headroom: max-turns 8 → 15, so the agent can finish before tripping the api-proxy maxRuns cap.
  • Recompiled doc-maintainer.lock.yml (maxRuns/GH_AW_MAX_TURNS → 15) and ran the smoke-workflow post-processor (no other lock files changed).

Why not change the api-proxy?

The 403-on-max_runs_exceeded behavior is intentional and documented in common-guard-checks.js (429 would cause retry-storms against a non-recovering cap). The mislabeling of that 403 as "Authentication failed" happens in the Copilot CLI / gh-aw harness, outside this repo. The correct, in-scope fix is to keep this workflow within its budget.

Fixes #5552

The Documentation Maintainer workflow failed daily (since 2026-06-20)
with a misleading "Authentication failed with provider ... (HTTP 403)"
engine error that auto-files a failure issue (#5552).

Root cause: the agent (claude-haiku-4.5, max-turns: 8) tried to run
shell commands (`npm test`, `ls -la src/*.test.ts`) to "verify code
examples", but the workflow sets `bash: false`. Each attempt returned
"Permission denied and could not request permission from user",
burning the agent's limited turn budget without producing any output.
gh-aw maps `max-turns` to the api-proxy `maxRuns` hard cap, so once the
8 LLM invocations were exhausted the api-proxy returned a terminal,
non-retryable 403 (max_runs_exceeded). The Copilot CLI surfaces that
403 as a generic "Authentication failed" credentials error, which the
failure-issue reporter treats as an engine crash.

Fix (workflow-scoped):
- Tell the agent explicitly it has no shell access and must not attempt
  `git`, `npm test`, `ls`, or any shell command (they fail and waste
  turns), and that code-example verification is read-only.
- Raise max-turns 8 -> 15 so the agent has budget headroom to finish
  before tripping the api-proxy maxRuns cap.

Recompiled doc-maintainer.lock.yml (maxRuns/GH_AW_MAX_TURNS -> 15).

Fixes #5552

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 26, 2026 15:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Documentation Maintainer agentic workflow to avoid wasting turns on disallowed shell usage (which can exhaust the api-proxy maxRuns budget and surface as a misleading Copilot “Authentication failed (403)” error), and increases the turn budget to provide more headroom for completing the documentation sync task.

Changes:

  • Increase the workflow agent budget from max-turns: 815.
  • Update the agent prompt to explicitly forbid shell-based verification and emphasize read-only verification via precomputed context and file reading.
  • Regenerate doc-maintainer.lock.yml so maxRuns/GH_AW_MAX_TURNS match the new max-turns value.
Show a summary per file
File Description
.github/workflows/doc-maintainer.md Bumps max-turns and strengthens the prompt instructions to avoid shell attempts and wasted turns.
.github/workflows/doc-maintainer.lock.yml Regenerated lock to reflect maxRuns: 15 and GH_AW_MAX_TURNS: 15.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 2
  • Review effort level: Low

Comment thread .github/workflows/doc-maintainer.md Outdated
Comment thread .github/workflows/doc-maintainer.md Outdated
lpcox and others added 2 commits June 26, 2026 08:51
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

✅ Copilot review passed with no inline comments.

@lpcox Add the ready-for-aw label to this PR to trigger agentic CI smoke tests.

@github-actions

Copy link
Copy Markdown
Contributor

🚀 Security Guard has started processing this pull request

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Build Test Suite completed successfully!

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Smoke Gemini completed. All facets verified. 💎

Testing bridge connectivity

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

🔌 Smoke Services — All services reachable! ✅

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Contribution Check completed successfully!

Contribution guidelines review complete: PR #5564 follows the applicable CONTRIBUTING.md requirements based on the provided metadata, diff, and contributing guide; no PR comment needed.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Smoke Claude passed

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Claude Engine Validation

  • API check: ✅ PASS
  • gh check: ✅ PASS
  • File check: ✅ PASS

Overall result: PASS

Generated by Smoke Claude for issue #5564 · 61.4 AIC · ⊞ 6.5K ·

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Copilot BYOK (Direct) ✅ PASS

Tests:

Mode: Direct BYOK (COPILOT_PROVIDER_API_KEY) via api-proxy sidecar

Result: All checks passed. Workflow is processing correctly.

🔑 BYOK report filed by Smoke Copilot BYOK

@github-actions

Copy link
Copy Markdown
Contributor

🔬 Smoke Test Results — PASS

PR: fix(doc-maintainer): prevent maxRuns 403 from wasted shell turns by @lpcox

Test Result
GitHub MCP connectivity
GitHub.com HTTP ✅ 200
File write/read ⚠️ Pre-step vars not substituted

Overall: PASS (2/2 live checks passed; pre-step data unavailable due to unsubstituted template vars)

📰 BREAKING: Report filed by Smoke Copilot

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test

  • fix: propagate apiProxy.auth OIDC config fields to all layers
  • [Test Coverage] security: test coverage for compose-sanitizer, domain-validation, and domain-matchers
  • GitHub title check: ✅
  • PR query check: ✅
  • Discussion lookup/comment: ✅
  • File write/read: ✅
  • Build: ✅
  • Overall status: PASS

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

@github-actions

Copy link
Copy Markdown
Contributor

🔥 Smoke Test Results — Auth mode: PAT (COPILOT_GITHUB_TOKEN)

Test Result
GitHub MCP connectivity
GitHub.com HTTP ✅ 200
File write/read ⚠️ pre-step vars not expanded

Overall: PASS (core connectivity verified)

PR: fix(doc-maintainer): prevent maxRuns 403 from wasted shell turns
Author: @lpcox · No assignees

🔑 PAT report filed by Smoke Copilot PAT

@github-actions

Copy link
Copy Markdown
Contributor

🔭 Smoke Test: API Proxy OpenTelemetry Tracing

Scenario Result Notes
1. Module Loading ✅ Pass otel.js loads cleanly; exports: startRequestSpan, setTokenAttributes, setBudgetAttributes, endSpan, endSpanError, shutdown, isEnabled + internal helpers
2. Test Suite ✅ Pass otel.test.js (39 tests) + otel-fanout.test.js (20 tests) — 59/59 passed
3. Env Var Forwarding ✅ Pass api-proxy-env-config.ts forwards OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, GITHUB_AW_OTEL_TRACE_ID, GITHUB_AW_OTEL_PARENT_SPAN_ID, OTEL_SERVICE_NAME; observability-environment.ts forwards all OTEL_* vars to agent container
4. Token Tracker Integration ✅ Pass token-tracker-http.js onUsage callback present; upstream-token.js wires otel.setTokenAttributes / setBudgetAttributes / endSpan to the callback
5. OTEL Diagnostics ✅ Pass (graceful degradation) No OTEL_EXPORTER_OTLP_ENDPOINT configured in sandbox — otel.js loaded without error, confirming no-config degradation works

All scenarios pass. OTEL tracing integration is functional on this PR's branch.

📡 OTel tracing validated by Smoke OTel Tracing

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test Results

  • PRs: Could not retrieve titles (Network/Tool error) ❌
  • GitHub Connectivity: ❌
  • File Writing: ✅
  • Bash Tool: ✅

Overall status: FAIL

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • localhost

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "localhost"

See Network Configuration for more information.

💎 Faceted by Smoke Gemini

@github-actions

Copy link
Copy Markdown
Contributor

@lpcox ✅ GitHub MCP test
✅ GitHub.com connectivity (HTTP 200)
✅ File write/read test
✅ Direct BYOK inference mode
Running in direct BYOK mode (AWF_AUTH_TYPE=github-oidc + AWF_AUTH_AZURE_* + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) authenticated via Microsoft Entra
Overall: PASS

🪪 BYOK (AOAI Entra) report filed by Smoke Copilot BYOK AOAI (Entra)

@github-actions

Copy link
Copy Markdown
Contributor

🧪 Chroot Version Comparison Results

Runtime Host Version Chroot Version Match?
Python Python 3.12.13 Python 3.12.3 ❌ NO
Node.js v24.17.0 v22.23.0 ❌ NO
Go go1.22.12 go1.22.12 ✅ YES

Overall: ❌ FAILED — Python and Node.js versions differ between host and chroot environments.

Tested by Smoke Chroot

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: GitHub Actions Services Connectivity

Check Result
Redis PING ❌ TCP timeout (no PONG)
PostgreSQL pg_isready ❌ No response
PostgreSQL SELECT 1 ❌ Skipped (pg_isready failed)

Overall: FAIL

host.docker.internal resolves to 172.17.0.1 but both ports 6379 and 5432 are unreachable (TCP timeout). The service containers do not appear to be running or accessible from this environment.

🔌 Service connectivity validated by Smoke Services

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test Results for PR #5564

@lpcox

  • GitHub MCP connectivity: ✅
  • GitHub.com HTTP: ✅
  • File write/read: ✅
  • BYOK inference: ✅

Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw)

Overall: PASS

🔑 BYOK (AOAI api-key) report filed by Smoke Copilot BYOK AOAI (api-key)

@github-actions

Copy link
Copy Markdown
Contributor

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color ok ✅ PASS
Go env ok ✅ PASS
Go uuid ok ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx all passed ✅ PASS
Node.js execa all passed ✅ PASS
Node.js p-limit all passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Notes
  • Java: Initial run failed with LocalRepositoryNotAccessibleException because ~/.m2 was root-owned and ~/.m2/repository could not be created. Fixed by specifying -Dmaven.repo.local=/tmp/gh-aw/agent/m2-repo.
  • Rust: Test repos (fd, zoxide) used lightweight wrapper crates — builds completed quickly.
  • Deno: Dependencies downloaded from deno.land/std@0.208.0 on first run.

Generated by Build Test Suite for issue #5564 · 82 AIC · ⊞ 7.8K ·

@lpcox lpcox merged commit 8115e88 into main Jun 26, 2026
87 of 89 checks passed
@lpcox lpcox deleted the fix/doc-maintainer-maxruns-403-5552 branch June 26, 2026 19:42
lpcox added a commit that referenced this pull request Jun 26, 2026
#5587)

PR #5564 raised the Documentation Maintainer `max-turns` from 8 to 15
(to stop the agent exhausting its maxRuns budget on denied shell turns
and surfacing a misleading 403) and rewrote the shell-restriction prompt
line, but did not update scripts/ci/doc-maintainer-workflow.test.ts.

This left `npm test` red with two stale assertions:
- `max-turns: 8` / `GH_AW_MAX_TURNS: 8` -> now 15
- `**Do not run any \`git\` commands**` -> replaced by the new
  `**Do not use the \`shell\` tool** ...` wording

Update the assertions to match the current source workflow and lock.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[aw] Documentation Maintainer failed

2 participants