Wait for running state before deleting in lifecycle E2E test by jerm-dro · Pull Request #4745 · stacklok/toolhive

jerm-dro · 2026-04-10T20:06:04Z

Summary

The workload lifecycle E2E test (should track workload through create-list-delete lifecycle) intermittently times out on main — 12/147 runs (8.2% failure rate), always at 120s
Root cause: the test deletes the workload before the proxy-runner finishes setting status to "running". The proxy-runner then overwrites the "removing" status back to "running", causing the delete verification poll to never complete
Add a "wait for running" poll between the list verification and delete steps, matching the pattern used by the stable "delete running workload" test

Type of change

Bug fix

Test plan

Linting (task lint-fix)
Manual testing (describe below)

Verified the fix compiles. The flake requires CI to reproduce — this PR should reduce the api-workloads E2E failure rate from ~8% to near zero for this test.

Special notes for reviewers

The underlying production race (proxy-runner can overwrite "removing" status with "running" during concurrent delete + startup) is a separate issue worth tracking. This PR only fixes the test-side trigger.

Generated with Claude Code

The workload lifecycle test intermittently times out (12/147 runs on main, 8.2% failure rate) because it deletes the workload before the proxy-runner finishes setting status to "running". The proxy-runner then overwrites the "removing" status back to "running", causing the delete verification to never complete. Add a "wait for running" poll between the list verification and delete steps, matching the pattern used by the stable "delete running workload" test. This closes the race window between proxy-runner startup and deletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a /deflake skill that analyzes GitHub Actions failures on main to discover, rank, and plan fixes for flaky tests. The skill includes a Python collection script that deterministically fetches failed run logs in parallel, extracts test names from Ginkgo and gotestfmt output, and aggregates failures into a ranked report. Used this skill to identify and fix the #1 flake (workload lifecycle E2E test, 12/147 runs) in #4745. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-10T20:12:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.76%. Comparing base (3d7b916) to head (394fe0b).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4745      +/-   ##
==========================================
+ Coverage   68.75%   68.76%   +0.01%     
==========================================
  Files         516      516              
  Lines       54229    54307      +78     
==========================================
+ Hits        37283    37343      +60     
- Misses      14099    14110      +11     
- Partials     2847     2854       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude

LGTM — the added poll correctly gates deletion on WorkloadStatusRunning, directly fixing the documented 8.2% flake.

Extended reasoning...

Overview

Single-file change to test/e2e/api_workloads_test.go: adds 12 lines inserting a "Step 3.5" Eventually poll that waits for WorkloadStatusRunning before the delete step in the lifecycle test.

Security risks

None — this is an E2E test file with no production code changes.

Level of scrutiny

Low. The change is mechanical, isolated to a test, and follows the exact same polling pattern used elsewhere in the same file (e.g., the "should successfully delete stopped workload" test). The fix is a textbook solution to the described race condition.

Other factors

The inline bug comment flags a pre-existing issue in a different test ("should successfully delete running workload") that was not introduced by this PR and is out of scope for this fix. The PR description already acknowledges the underlying production race as a separate tracked issue. The fix itself is correct and sufficient for its stated goal.

* Add deflake skill for finding and fixing flaky tests Adds a /deflake skill that analyzes GitHub Actions failures on main to discover, rank, and plan fixes for flaky tests. The skill includes a Python collection script that deterministically fetches failed run logs in parallel, extracts test names from Ginkgo and gotestfmt output, and aggregates failures into a ranked report. Used this skill to identify and fix the #1 flake (workload lifecycle E2E test, 12/147 runs) in #4745. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address review feedback on collect-flakes script - Extract per-test log context (50 lines before the failure marker) before classifying failure mode, so tests in the same run get accurate individual mode labels instead of all inheriting the first match from the full run log - Add try/except around future.result() so one failed run doesn't crash the script and lose all collected data - Fix misleading comment about MAX_PAGES covering 300 Main build runs — the API returns all workflows' runs, not just Main build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix per-test failure mode extraction in collect-flakes The previous attempt to extract per-test failure context used a 50-line window before the [FAIL] summary line, but Ginkgo's [FAILED] reason line (e.g., "Timed out after 120s") can appear thousands of lines earlier. Also needed ANSI stripping when searching for [FAILED] markers. Now searches backwards from the [FAIL] summary to find all [FAILED] lines in the failure block, uses the earliest one (which has the actual failure reason), and extracts context spanning all of them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jerm-dro requested review from ChrisJBurns, JAORMX, amirejaz, jhrozek, lujunsan, rdimitrov and yrobla as code owners April 10, 2026 20:06

jerm-dro mentioned this pull request Apr 10, 2026

Add deflake skill for finding and fixing flaky tests #4746

Merged

2 tasks

github-actions Bot added the size/XS Extra small PR: < 100 lines changed label Apr 10, 2026

claude Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread test/e2e/api_workloads_test.go

rdimitrov approved these changes Apr 12, 2026

View reviewed changes

JAORMX merged commit b79c6bf into main Apr 13, 2026
40 checks passed

JAORMX deleted the jdrouillard/deflake-workload-lifecycle branch April 13, 2026 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for running state before deleting in lifecycle E2E test#4745

Wait for running state before deleting in lifecycle E2E test#4745
JAORMX merged 1 commit into
mainfrom
jdrouillard/deflake-workload-lifecycle

jerm-dro commented Apr 10, 2026

Uh oh!

codecov Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jerm-dro commented Apr 10, 2026

Summary

Type of change

Test plan

Special notes for reviewers

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 10, 2026 •

edited

Loading