Skip to content

[ci-scan] Skip stackoverflowtester under interpreter mode (refs #127899)#128737

Closed
github-actions[bot] wants to merge 2 commits into
mainfrom
ci-scan/disable-stackoverflowtester-interpreter-203b9d4d60370ea7
Closed

[ci-scan] Skip stackoverflowtester under interpreter mode (refs #127899)#128737
github-actions[bot] wants to merge 2 commits into
mainfrom
ci-scan/disable-stackoverflowtester-interpreter-203b9d4d60370ea7

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Reasoning

The stackoverflowtester test hits a StackFrameIterator assert (m_crawl.GetCodeInfo()->IsValid()) when running under the CoreCLR interpreter. The interpreter does not produce valid CodeInfo for its frames during stack overflow unwinding, causing the process to fail-fast (exit code 0xC0000602) instead of producing the expected stack overflow exit code (0xC00000FD). This is a known interpreter-mode incompatibility, not a product bug in the non-interpreter path.

Linked KBE: #127899

Match verification (from Step 4.8):

  1. Same test/family: yes — baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.cmd in runtime-interpreter pipeline
  2. Same failure signature: yes — m_crawl.GetCodeInfo()->IsValid() assert followed by 0xC0000602 exit code (2 matches in failure.log)
  3. Same OS: yes — Windows (KBE lists windows-x64 and windows-arm64; current failure is windows-arm64)
  4. Same architecture: yes — arm64 (also x64 in prior builds)

Impact on platforms

  • runtime-interpreter (def 316) / windows arm64 Checked / Windows.11.Arm64.Open / interpreter mode / exit code 1
  • runtime-interpreter (def 316) / windows x64 Checked / Windows.11.Amd64.Open / interpreter mode / exit code 1

Errors log

System.Exception: Exit code: 0xC0000602, expected 0xC00000FD or 0x800703E9
   at TestStackOverflow.Program.TestStackOverflow(String testName, String testArgs, List`1& stderrLines)
   at TestStackOverflow.Program.TestStackOverflowSmallFrameSecondaryThread()
Expected: 100
Actual: 101
END EXECUTION - FAILED

First build it occurred

Linked issue

#127899


Filed by ci-failure-scan, which scans dnceng-public outer-loop pipelines on main and converts stable failures into KBEs and test-disable PRs. Comment here or on the workflow file to suggest changes; ci-failure-scan-feedback reads in-scope feedback daily and opens (or updates) a PR with prompt edits.

Note

🔒 Integrity filter blocked 1 item

The following item was blocked because it doesn't meet the GitHub integrity level.

  • #125825 search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by CI Outer-Loop Failure Scanner · ● 30.6M ·

The stackoverflowtester test hits a StackFrameIterator assert
(m_crawl.GetCodeInfo()->IsValid()) when running under the CoreCLR
interpreter, causing the process to fail-fast with 0xC0000602 instead
of the expected stack overflow exit code. This has been consistently
failing on windows-arm64 and windows-x64 in the runtime-interpreter
pipeline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @BrzVlad, @janvorli, @kg
See info in area-owners.md if you want to be subscribed.

@janvorli
Copy link
Copy Markdown
Member

This issue needs to be understood, there is no inherent reason why the codeinfo would be invalid. The cases when it occured in the CI don't have the dumps / logs available anymore, so I don't think we should disable the test without trying to understand the problem.

@BrzVlad
Copy link
Copy Markdown
Member

BrzVlad commented May 29, 2026

windows runtime tests was completely broken up until recently so these failures started being reported now. Here is the failure as I'm seeing it in the latest run:

BEGIN EXECUTION
 "C:\h\w\A15608B8\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true"  stackoverflowtester.dll 
Running stackoverflow test(smallframe main)
"Stack overflow."
""
"Assert failure(PID 6572 [0x000019ac], Thread: 7320 [0x1c98]): m_crawl.GetCodeInfo()->IsValid()"
""
"CORECLR! StackFrameIterator::NextRaw + 0x724 (0x00007ffc`4cdafcbc)"
"CORECLR! StackFrameIterator::Filter + 0xBB0 (0x00007ffc`4cdae6f0)"
"CORECLR! StackFrameIterator::Init + 0x258 (0x00007ffc`4cdaefd8)"
"CORECLR! Thread::StackWalkFramesEx + 0x178 (0x00007ffc`4cdb0d28)"
"CORECLR! Thread::StackWalkFrames + 0x12C (0x00007ffc`4cdb0b1c)"
"CORECLR! LogCallstackForLogWorker + 0x19C (0x00007ffc`4ce45234)"
"CORECLR! LogStackOverflowStackTraceThread + 0x10 (0x00007ffc`4ce45e70)"
"KERNEL32! BaseThreadInitThunk + 0x40 (0x00007ffc`c3028740)"
"<no module>! <no symbol> + 0x0 (0x1a3f7ffc`c6b24714)"
"    File: D:\a\_work\1\s\src\coreclr\vm\stackwalk.cpp:2289"
"    Image: C:\h\w\A15608B8\p\corerun.exe"
""
""
System.Exception: Exit code: 0xC0000602, expected 0xC00000FD or 0x800703E9
   at TestStackOverflow.Program.TestStackOverflow(String testName, String testArgs, List`1& stderrLines)
   at TestStackOverflow.Program.TestStackOverflowSmallFrameMainThread()
   at __GeneratedMainWrapper.Main()
Expected: 100
Actual: 101
END EXECUTION - FAILED

I'm also very aggressive with fixing these pipelines without disabling any tests. I would normally take a look at this next week unless you want to look into this @janvorli

@janvorli
Copy link
Copy Markdown
Member

@BrzVlad I'll try to take a look today.

kotlarmilos added a commit that referenced this pull request Jun 1, 2026
…enced-issue checks (#128760)

## Description

Three prompt edits to `.github/workflows/ci-failure-scan.md` proposed by
the ci-failure-scan-feedback meta-loop in #128755. Step 4.7 now also
reads the body and latest 5 comments of any issue referenced in the KBE,
so maintainer "do not disable" signals on the root-cause issue override
the KBE candidate (motivated by #128737). The `ActiveIssue` section
gains the full 4-parameter overload `(string, TestPlatforms,
TargetFrameworkMonikers, TestRuntimes)` with an explicit warning not to
place a `TestRuntimes` value in the `TargetFrameworkMonikers` slot
(motivated by #128469, which failed to compile from that exact swap). A
short pre-emit compile-validation checklist is added for test-disable
PRs. No `gh aw compile` needed: the scanner workflow loads this file at
runtime via `{{#runtime-import}}`. Closes #128755.

Co-authored-by: Milos Kotlar <miloskotlar@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kotlarmilos added a commit that referenced this pull request Jun 1, 2026
…signal (#128837)

## Description

Extends Step 4.7 of `.github/workflows/ci-failure-scan.md` so the CI
failure scanner respects maintainer closes of prior `[ci-scan]`
test-disable PRs. A `MEMBER`/`OWNER` close of a recent (within 30 days)
test-disable PR for the same test or KBE is now treated as a
do-not-disable signal, and re-filing requires fresh evidence such as a
new maintainer comment on the KBE greenlighting the disable or a clearly
different failure signature.

PR #128793 re-filed the `stackoverflowtester` interpreter-mode disable
that @BrzVlad had closed the day before in PR #128737 with a pushback to
investigate the assert rather than mute the test. The earlier prompt fix
in PR #128760 did not catch this case for two reasons:
1. It was merged on 2026-06-01, after the scanner run on 2026-05-30 that
produced PR #128793.
2. Even if applied, it only inspects the KBE issue body and the bodies
of issues referenced from it. KBE #127899 has no maintainer comments.
The do-not-disable signal lived on closed PR #128737, not on the KBE.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kotlarmilos added a commit to kotlarmilos/runtime that referenced this pull request Jun 1, 2026
Rewrites Step 7 of ci-failure-scan-feedback.md to produce a smaller,
plainer tracker focused on what maintainers actually want to see:
opened/closed counts, a single wrong-closure rate, and a red/green
table of outage signals on the analyzed CI.

Dropped:
- Acceptance classification with Wilson 95% lower bound. The bucket
  rules were opinionated and the Wilson math was overkill for an
  N=27 sample. Replaced by a flat 'wrong-closure rate' in the
  Quality block.
- 90-day window. Identical to 30d while the scanner is < 30 days
  old; will become useful later but adds rows now.
- CI environment section sourced from agent-log tally extraction.
  This rendered as '— (tally extraction unavailable)' most ticks
  because the parser was fragile; drop it rather than ship n/a.
- Coverage and time-to-KBE percentiles. Also rendered as n/a.
- Per-artifact rubric scorecard table in the agent log.

Kept:
- Tracker bootstrap, window-start caching, in-scope search,
  integrity-gated comment reads.

Added: an explicit Outage signals table reflecting the health of
the CI being monitored (not the scanner workflow). Five signals
with fixed thresholds and a 🔴/🟢 status icon:
- New-KBE burst (day > 2x trailing 30d median)
- Build-break spike (>= 2 in any 24h)
- Multi-pipeline outage (>= 3 distinct pipelines in 24h)
- KBE re-filed after maintainer close (any in 7d) — would have
  caught dotnet#128737/dotnet#128793
- Wrong-closure rate >= 30% with N>=10

Red rows must append a 'details:' line citing the offending
artifact so maintainers can jump straight to it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kotlarmilos added a commit that referenced this pull request Jun 1, 2026
## Description

Rewrites Step 7 of `.github/workflows/ci-failure-scan-feedback.md` to
produce a smaller, plainer KPI tracker (#128742) focused on what
maintainers actually want to see at a glance.

## Motivation

Current tracker headline metrics are overengineered for a scanner that's
< 30 days old and mostly render as `n/a` (tally extraction unavailable,
distinct_signatures < 10, etc.). Maintainers want simple counts plus a
clear signal when CI itself is degrading. Per direct feedback: simple
measures for now, plus signals for bigger outages.

## What changes

### Removed
- Acceptance classification with Wilson 95% lower bound. The bucket
rules were opinionated and the Wilson math is overkill at N=27. Replaced
by a flat wrong-closure rate.
- 90-day window. Identical to 30d while the scanner is < 30 days old.
- CI environment section sourced from agent-log tally extraction.
Renders as `— (tally extraction unavailable)` most ticks; dropped rather
than shipped as n/a.
- Coverage ratio and time-to-KBE percentiles. Also n/a most ticks.
- Per-artifact rubric scorecard table appended to the agent log.

### Kept
- Tracker bootstrap, window-start caching, in-scope search,
integrity-gated comment reads.

### Added: Outage signals (analyzed CI)
Five signals with fixed thresholds and a 🔴/🟢 status icon. These reflect
the health of the **CI being monitored**, not the scanner workflow
itself:

| signal | threshold |
|---|---|
| New-KBE burst | day > 2x trailing 30d median (min 3) |
| Build-break spike | >= 2 in any 24h |
| Multi-pipeline outage | >= 3 distinct pipelines in 24h |
| KBE re-filed after maintainer close | any in 7d |
| Wrong-closure rate | >= 30% with N>=10 |

The re-file signal would have flagged the #128737 / #128793 episode.
When a row is 🔴, the tracker appends a `details:` line citing the
offending artifact so maintainers can jump straight to it.

## New body shape

```
## Snapshot — <UTC timestamp>

### Activity (last 7d)
| artifact | opened | closed (good) | closed (wrong) |
| Issues   | …      | completed     | not_planned/duplicate |
| PRs      | …      | merged        | closed unmerged       |

### Quality (last 30d)
| metric                        | count | rate |
| Total artifacts opened        |       | —    |
| Wrong closures                |       | %    |
| Maintainer rejection comments |       | —    |
| Duplicate KBEs                |       | —    |

### Outage signals (analyzed CI)
| signal                                          | threshold                | 24h | 7d | status |
| New-KBE burst                                   | day > 2x 30d median (min 3) |     |    |        |
| Build-break spike                               | >= 2 in any 24h          |     |    |        |
| Multi-pipeline outage                           | >= 3 distinct in 24h     |     |    |        |
| KBE re-filed after maintainer close             | any in 7d                |     |    |        |
| Wrong-closure rate (30d)                        | >= 30% with N>=10        | —   |    |        |
```

cc @kg

> [!NOTE]
> This PR was authored with assistance from GitHub Copilot CLI.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kotlarmilos added a commit that referenced this pull request Jun 2, 2026
…line dedup (#128840)

Two precision fixes to the CI Outer-Loop Failure Scanner prompt
(`.github/workflows/ci-failure-scan.md`), driven by observed scanner
misbehavior: a test-disable PR rejected by maintainers who wanted to
investigate (PR #128737), and a duplicate KBE filed for the same
signature under two pipeline definitions (#128697, dup of #128531).

### Step 4.7 — Stronger do-not-disable detection
- Added six rejection phrases to the MEMBER/OWNER phrase list: `trying
to understand`, `without disabling`, `i'm fixing`, `i am fixing`,
`understand the problem`, `root cause`.
- Added a heuristic catch-all so the scanner skips a test-disable when
any MEMBER/OWNER comment reads as investigation or fix-forward intent,
even when no exact phrase matches.

### Step 4.0 — Cross-definition dedup
- Added a second dedup check on the definition-independent key
`<queue>|<stress_mode>|<signature_norm>`. Since a KBE matches on
signature text regardless of pipeline definition, the same signature
surfacing under different definition IDs in one run is now skipped as a
cross-def dup instead of filed as a second KBE.
- Clarified scope: cross-def dedup applies to KBE filing (Branch A); a
distinct test-disable PR per definition is still permitted.

Markdown-only change to the workflow prompt. The body is read from the
repo at runtime and is not embedded in `ci-failure-scan.lock.yml`, and
frontmatter is unchanged, so the generated lock file is unaffected.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kotlarmilos <11523312+kotlarmilos@users.noreply.github.com>
Co-authored-by: Milos Kotlar <kotlarmilos@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Milos Kotlar <115233127+kotlarmilos@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants