Feature: classify failure causes (infra vs agent) before eport-failure-as-issue triggers

## Summary

`safe-outputs.report-failure-as-issue` (default `true`) currently fires on **any** non-success job status, including transient infra failures that are clearly not actionable by the maintainer (Docker pulls of MCP container images timing out, Copilot/AI provider 5xx, squid firewall startup failures, etc.). For unattended scheduled workflows this produces a steady stream of noise issues that have to be manually closed.

## Motivation

We have had to opt three different scheduled workflows out of `report-failure-as-issue` for exactly this reason:

| Workflow | Concrete cause | Workaround PR |
|---|---|---|
| `build-failure-analysis` | Copilot/AI server flake during agent run on otherwise-green PRs | microsoft/testfx#8726 |
| `adhoc-qa` | Transient Docker registry timeouts pulling MCP images | microsoft/testfx#8734 |
| `sub-issue-closer` | Same — daily scheduled, MCP image pull intermittent | microsoft/testfx#9000 |

Each of those PRs is a one-line `report-failure-as-issue: false` opt-out, which is the wrong tradeoff: it also silences *real* agent-side failures that would be worth a tracking issue.

The closest existing tracker I found is #26069 ("Systemic MCP registry 401 failures block all agentic workflow safe outputs"), which is about the symptom but not about the noise classification.

## Proposed solution (any of these would help)

1. **Failure classification.** Tag each failure with a category (`infra-pull`, `mcp-401`, `firewall-unhealthy`, `agent-error`, `user-script-error`, `safe-output-validation`, …) based on which step failed, and let `report-failure-as-issue` filter:

   ```yaml
   safe-outputs:
     report-failure-as-issue:
       categories: [agent-error, safe-output-validation]   # not infra
   ```

2. **Smarter default for scheduled triggers.** When `on:` is `schedule` (and only `schedule`), default `report-failure-as-issue` to `false` and require explicit opt-in. Scheduled workflows are the noisiest pattern; PR-triggered ones already have the PR-check signal.

3. **Built-in dedupe.** Already-open issue for the same `(workflow, failure-category)` should be deduped/commented instead of opening a new one.

## Related

- #26069 — MCP registry 401 failures (infra root cause, not noise classification)
- #28776 — Circuit breaker for repeatedly failing workflows (adjacent but different scope)

## Environment

- gh-aw: v0.75.x


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: classify failure causes (infra vs agent) before eport-failure-as-issue triggers #38565

Summary

Motivation

Proposed solution (any of these would help)

Related

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workflow	Concrete cause	Workaround PR
`build-failure-analysis`	Copilot/AI server flake during agent run on otherwise-green PRs	microsoft/testfx#8726
`adhoc-qa`	Transient Docker registry timeouts pulling MCP images	microsoft/testfx#8734
`sub-issue-closer`	Same — daily scheduled, MCP image pull intermittent	microsoft/testfx#9000

Feature: classify failure causes (infra vs agent) before eport-failure-as-issue triggers #38565

Description

Summary

Motivation

Proposed solution (any of these would help)

Related

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions