Summary
safe-outputs.report-failure-as-issue (default true) currently fires on any non-success job status, including transient infra failures that are clearly not actionable by the maintainer (Docker pulls of MCP container images timing out, Copilot/AI provider 5xx, squid firewall startup failures, etc.). For unattended scheduled workflows this produces a steady stream of noise issues that have to be manually closed.
Motivation
We have had to opt three different scheduled workflows out of report-failure-as-issue for exactly this reason:
| Workflow |
Concrete cause |
Workaround PR |
build-failure-analysis |
Copilot/AI server flake during agent run on otherwise-green PRs |
microsoft/testfx#8726 |
adhoc-qa |
Transient Docker registry timeouts pulling MCP images |
microsoft/testfx#8734 |
sub-issue-closer |
Same — daily scheduled, MCP image pull intermittent |
microsoft/testfx#9000 |
Each of those PRs is a one-line report-failure-as-issue: false opt-out, which is the wrong tradeoff: it also silences real agent-side failures that would be worth a tracking issue.
The closest existing tracker I found is #26069 ("Systemic MCP registry 401 failures block all agentic workflow safe outputs"), which is about the symptom but not about the noise classification.
Proposed solution (any of these would help)
-
Failure classification. Tag each failure with a category (infra-pull, mcp-401, firewall-unhealthy, agent-error, user-script-error, safe-output-validation, …) based on which step failed, and let report-failure-as-issue filter:
safe-outputs:
report-failure-as-issue:
categories: [agent-error, safe-output-validation] # not infra
-
Smarter default for scheduled triggers. When on: is schedule (and only schedule), default report-failure-as-issue to false and require explicit opt-in. Scheduled workflows are the noisiest pattern; PR-triggered ones already have the PR-check signal.
-
Built-in dedupe. Already-open issue for the same (workflow, failure-category) should be deduped/commented instead of opening a new one.
Related
Environment
Summary
safe-outputs.report-failure-as-issue(defaulttrue) currently fires on any non-success job status, including transient infra failures that are clearly not actionable by the maintainer (Docker pulls of MCP container images timing out, Copilot/AI provider 5xx, squid firewall startup failures, etc.). For unattended scheduled workflows this produces a steady stream of noise issues that have to be manually closed.Motivation
We have had to opt three different scheduled workflows out of
report-failure-as-issuefor exactly this reason:build-failure-analysisadhoc-qasub-issue-closerEach of those PRs is a one-line
report-failure-as-issue: falseopt-out, which is the wrong tradeoff: it also silences real agent-side failures that would be worth a tracking issue.The closest existing tracker I found is #26069 ("Systemic MCP registry 401 failures block all agentic workflow safe outputs"), which is about the symptom but not about the noise classification.
Proposed solution (any of these would help)
Failure classification. Tag each failure with a category (
infra-pull,mcp-401,firewall-unhealthy,agent-error,user-script-error,safe-output-validation, …) based on which step failed, and letreport-failure-as-issuefilter:Smarter default for scheduled triggers. When
on:isschedule(and onlyschedule), defaultreport-failure-as-issuetofalseand require explicit opt-in. Scheduled workflows are the noisiest pattern; PR-triggered ones already have the PR-check signal.Built-in dedupe. Already-open issue for the same
(workflow, failure-category)should be deduped/commented instead of opening a new one.Related
Environment