Fix flaky oximeter-collector agent tests by mergeconflict · Pull Request #10040 · oxidecomputer/omicron

mergeconflict · 2026-03-12T00:38:49Z

Fixes #8636 (and likely the same root cause as #7255 and #7220).

Three agent tests have been reported as flaky on loaded CI machines: test_self_stat_collection_count (#8636, #4657), test_self_stat_error_counter (#7255), and verify_producer_details (#7220). A fourth, test_self_stat_unreachable_counter, uses the same pattern and was likely susceptible too.

They all shared the same approach: pause tokio time, advance by a fixed TEST_WAIT_PERIOD, then assert. On slow machines, the collection task couldn't keep up with the timer: its bounded channel (capacity 1) would overflow when the next tick fired before the previous collection finished, recording spurious CollectionsInProgress failures.

The fix introduces advance_n_collections(), which advances paused time in small increments (TICK_INTERVAL), using a watch channel to wait for each collection to complete before moving on to the next. This guarantees the channel is drained before the next timer tick fires, so collections never overlap.

Because collections can no longer overlap, test_self_stat_error_counter gets substantially simpler: the old code had to tolerate both CollectionsInProgress and 500 failure reasons (summing across them) and allowed +1 slack between the server-side and task-side counts. Now it can assert exactly N_COLLECTIONS errors of a single type and exact count equality.

test_updated_producer_is_still_collected_from also gets a more precise wait loop after re-registration: instead of advancing time by a fixed amount and hoping a collection happened, it watches for the server-side collection_count to confirm an actual collection from the new server (as opposed to just a producer info update from re-registration).

Also adds CollectionTaskHandle::details_watcher() (#[cfg(test)]) to expose the watch receiver.

bnaecker

I think this is an improvement, but I agree with John that it's still possible we'll trip the failed_collections.is_empty() assertion. I'm not sure what to do about that, and we might want to simply remove it.

Replace blind simulated-time advancement with condition-based waiting in four tests that were flaky on loaded CI machines (issue #8636). The root cause was that the collection task's bounded channel (capacity 1) would overflow when the timer fired before the previous collection completed, recording spurious FailedCollection entries. The new advance_until() helper advances paused tokio time in small increments, checking a condition each iteration, which gives the runtime a chance to process pending work between timer firings.

The previous advance_until approach still allowed the bounded timer channel to overflow, because it polled an external condition without knowing whether the collection task had actually drained the channel. Replace it with advance_n_collections, which subscribes to the collection task's watch channel and only advances to the next tick after the current collection signals completion. Also strengthen assertions in test_self_stat_error_counter to require exactly one failure type (500) with exactly N_COLLECTIONS count.

bnaecker

This is great! I like the new approach of taking one collection at a time.

Following on from #10040. Switch the test to being level-triggered, and don't rely on paused time because we do real I/O which would introduce a second time domain. Identified and fixed by Claude based on the flake-patterns doc I wrote in #10577. Fixes #7255. Fixes #8636.

mergeconflict requested review from bnaecker and jgallagher March 12, 2026 00:39

jgallagher reviewed Mar 13, 2026

View reviewed changes

Comment thread oximeter/collector/src/agent.rs Outdated

bnaecker approved these changes Mar 13, 2026

View reviewed changes

mergeconflict force-pushed the mergeconflict/deflake-oximeter-collector-agent-tests branch from a7e2342 to fb5c399 Compare March 13, 2026 18:35

mergeconflict force-pushed the mergeconflict/deflake-oximeter-collector-agent-tests branch from fb5c399 to 7c2eade Compare March 13, 2026 18:39

mergeconflict requested review from bnaecker and jgallagher March 13, 2026 18:42

bnaecker reviewed Mar 17, 2026

View reviewed changes

Address review feedback on oximeter-collector test deflake

2d2e237

mergeconflict enabled auto-merge (squash) March 17, 2026 18:52

mergeconflict merged commit 9b5d906 into main Mar 17, 2026
16 checks passed

mergeconflict deleted the mergeconflict/deflake-oximeter-collector-agent-tests branch March 17, 2026 19:27

jmcarp mentioned this pull request Mar 17, 2026

Add per-zone cpu metrics. #9968

Merged

sunshowers mentioned this pull request Jun 10, 2026

[oximeter] really fix race in test_self_stat_collection_count #10585

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky oximeter-collector agent tests#10040

Fix flaky oximeter-collector agent tests#10040
mergeconflict merged 3 commits into
mainfrom
mergeconflict/deflake-oximeter-collector-agent-tests

mergeconflict commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

bnaecker left a comment

Uh oh!

bnaecker left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergeconflict commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bnaecker left a comment

Choose a reason for hiding this comment

Uh oh!

bnaecker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergeconflict commented Mar 12, 2026 •

edited

Loading