Skip to content

Wait for running state before deleting in lifecycle E2E test#4745

Merged
JAORMX merged 1 commit into
mainfrom
jdrouillard/deflake-workload-lifecycle
Apr 13, 2026
Merged

Wait for running state before deleting in lifecycle E2E test#4745
JAORMX merged 1 commit into
mainfrom
jdrouillard/deflake-workload-lifecycle

Conversation

@jerm-dro
Copy link
Copy Markdown
Contributor

Summary

  • The workload lifecycle E2E test (should track workload through create-list-delete lifecycle) intermittently times out on main — 12/147 runs (8.2% failure rate), always at 120s
  • Root cause: the test deletes the workload before the proxy-runner finishes setting status to "running". The proxy-runner then overwrites the "removing" status back to "running", causing the delete verification poll to never complete
  • Add a "wait for running" poll between the list verification and delete steps, matching the pattern used by the stable "delete running workload" test

Type of change

  • Bug fix

Test plan

  • Linting (task lint-fix)
  • Manual testing (describe below)

Verified the fix compiles. The flake requires CI to reproduce — this PR should reduce the api-workloads E2E failure rate from ~8% to near zero for this test.

Special notes for reviewers

The underlying production race (proxy-runner can overwrite "removing" status with "running" during concurrent delete + startup) is a separate issue worth tracking. This PR only fixes the test-side trigger.

Generated with Claude Code

The workload lifecycle test intermittently times out (12/147 runs on
main, 8.2% failure rate) because it deletes the workload before the
proxy-runner finishes setting status to "running". The proxy-runner
then overwrites the "removing" status back to "running", causing the
delete verification to never complete.

Add a "wait for running" poll between the list verification and delete
steps, matching the pattern used by the stable "delete running
workload" test. This closes the race window between proxy-runner
startup and deletion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jerm-dro added a commit that referenced this pull request Apr 10, 2026
Adds a /deflake skill that analyzes GitHub Actions failures on main to
discover, rank, and plan fixes for flaky tests. The skill includes a
Python collection script that deterministically fetches failed run logs
in parallel, extracts test names from Ginkgo and gotestfmt output, and
aggregates failures into a ranked report.

Used this skill to identify and fix the #1 flake (workload lifecycle
E2E test, 12/147 runs) in #4745.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/XS Extra small PR: < 100 lines changed label Apr 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.76%. Comparing base (3d7b916) to head (394fe0b).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4745      +/-   ##
==========================================
+ Coverage   68.75%   68.76%   +0.01%     
==========================================
  Files         516      516              
  Lines       54229    54307      +78     
==========================================
+ Hits        37283    37343      +60     
- Misses      14099    14110      +11     
- Partials     2847     2854       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the added poll correctly gates deletion on WorkloadStatusRunning, directly fixing the documented 8.2% flake.

Extended reasoning...

Overview

Single-file change to test/e2e/api_workloads_test.go: adds 12 lines inserting a "Step 3.5" Eventually poll that waits for WorkloadStatusRunning before the delete step in the lifecycle test.

Security risks

None — this is an E2E test file with no production code changes.

Level of scrutiny

Low. The change is mechanical, isolated to a test, and follows the exact same polling pattern used elsewhere in the same file (e.g., the "should successfully delete stopped workload" test). The fix is a textbook solution to the described race condition.

Other factors

The inline bug comment flags a pre-existing issue in a different test ("should successfully delete running workload") that was not introduced by this PR and is out of scope for this fix. The PR description already acknowledges the underlying production race as a separate tracked issue. The fix itself is correct and sufficient for its stated goal.

Comment thread test/e2e/api_workloads_test.go
jerm-dro added a commit that referenced this pull request Apr 10, 2026
* Add deflake skill for finding and fixing flaky tests

Adds a /deflake skill that analyzes GitHub Actions failures on main to
discover, rank, and plan fixes for flaky tests. The skill includes a
Python collection script that deterministically fetches failed run logs
in parallel, extracts test names from Ginkgo and gotestfmt output, and
aggregates failures into a ranked report.

Used this skill to identify and fix the #1 flake (workload lifecycle
E2E test, 12/147 runs) in #4745.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Address review feedback on collect-flakes script

- Extract per-test log context (50 lines before the failure marker)
  before classifying failure mode, so tests in the same run get
  accurate individual mode labels instead of all inheriting the
  first match from the full run log
- Add try/except around future.result() so one failed run doesn't
  crash the script and lose all collected data
- Fix misleading comment about MAX_PAGES covering 300 Main build
  runs — the API returns all workflows' runs, not just Main build

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix per-test failure mode extraction in collect-flakes

The previous attempt to extract per-test failure context used a 50-line
window before the [FAIL] summary line, but Ginkgo's [FAILED] reason
line (e.g., "Timed out after 120s") can appear thousands of lines
earlier. Also needed ANSI stripping when searching for [FAILED] markers.

Now searches backwards from the [FAIL] summary to find all [FAILED]
lines in the failure block, uses the earliest one (which has the
actual failure reason), and extracts context spanning all of them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX merged commit b79c6bf into main Apr 13, 2026
40 checks passed
@JAORMX JAORMX deleted the jdrouillard/deflake-workload-lifecycle branch April 13, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants