Skip to content

Fix flaky oximeter-collector agent tests#10040

Merged
mergeconflict merged 3 commits into
mainfrom
mergeconflict/deflake-oximeter-collector-agent-tests
Mar 17, 2026
Merged

Fix flaky oximeter-collector agent tests#10040
mergeconflict merged 3 commits into
mainfrom
mergeconflict/deflake-oximeter-collector-agent-tests

Conversation

@mergeconflict

@mergeconflict mergeconflict commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Fixes #8636 (and likely the same root cause as #7255 and #7220).

Three agent tests have been reported as flaky on loaded CI machines: test_self_stat_collection_count (#8636, #4657), test_self_stat_error_counter (#7255), and verify_producer_details (#7220). A fourth, test_self_stat_unreachable_counter, uses the same pattern and was likely susceptible too.

They all shared the same approach: pause tokio time, advance by a fixed TEST_WAIT_PERIOD, then assert. On slow machines, the collection task couldn't keep up with the timer: its bounded channel (capacity 1) would overflow when the next tick fired before the previous collection finished, recording spurious CollectionsInProgress failures.

The fix introduces advance_n_collections(), which advances paused time in small increments (TICK_INTERVAL), using a watch channel to wait for each collection to complete before moving on to the next. This guarantees the channel is drained before the next timer tick fires, so collections never overlap.

Because collections can no longer overlap, test_self_stat_error_counter gets substantially simpler: the old code had to tolerate both CollectionsInProgress and 500 failure reasons (summing across them) and allowed +1 slack between the server-side and task-side counts. Now it can assert exactly N_COLLECTIONS errors of a single type and exact count equality.

test_updated_producer_is_still_collected_from also gets a more precise wait loop after re-registration: instead of advancing time by a fixed amount and hoping a collection happened, it watches for the server-side collection_count to confirm an actual collection from the new server (as opposed to just a producer info update from re-registration).

Also adds CollectionTaskHandle::details_watcher() (#[cfg(test)]) to expose the watch receiver.

Comment thread oximeter/collector/src/agent.rs Outdated

@bnaecker bnaecker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, but I agree with John that it's still possible we'll trip the failed_collections.is_empty() assertion. I'm not sure what to do about that, and we might want to simply remove it.

Replace blind simulated-time advancement with condition-based waiting
in four tests that were flaky on loaded CI machines (issue #8636).

The root cause was that the collection task's bounded channel
(capacity 1) would overflow when the timer fired before the previous
collection completed, recording spurious FailedCollection entries.

The new advance_until() helper advances paused tokio time in small
increments, checking a condition each iteration, which gives the
runtime a chance to process pending work between timer firings.
@mergeconflict mergeconflict force-pushed the mergeconflict/deflake-oximeter-collector-agent-tests branch from a7e2342 to fb5c399 Compare March 13, 2026 18:35
The previous advance_until approach still allowed the bounded timer
channel to overflow, because it polled an external condition without
knowing whether the collection task had actually drained the channel.

Replace it with advance_n_collections, which subscribes to the
collection task's watch channel and only advances to the next tick
after the current collection signals completion. Also strengthen
assertions in test_self_stat_error_counter to require exactly one
failure type (500) with exactly N_COLLECTIONS count.
@mergeconflict mergeconflict force-pushed the mergeconflict/deflake-oximeter-collector-agent-tests branch from fb5c399 to 7c2eade Compare March 13, 2026 18:39

@bnaecker bnaecker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I like the new approach of taking one collection at a time.

Comment thread oximeter/collector/src/agent.rs
Comment thread oximeter/collector/src/agent.rs Outdated
Comment thread oximeter/collector/src/agent.rs Outdated
Comment thread oximeter/collector/src/agent.rs Outdated
Comment thread oximeter/collector/src/agent.rs Outdated
Comment thread oximeter/collector/src/agent.rs Outdated
@mergeconflict mergeconflict enabled auto-merge (squash) March 17, 2026 18:52
@mergeconflict mergeconflict merged commit 9b5d906 into main Mar 17, 2026
16 checks passed
@mergeconflict mergeconflict deleted the mergeconflict/deflake-oximeter-collector-agent-tests branch March 17, 2026 19:27
sunshowers added a commit that referenced this pull request Jun 11, 2026
Following on from #10040.

Switch the test to being level-triggered, and don't rely on paused time
because we do real I/O which would introduce a second time domain.

Identified and fixed by Claude based on the flake-patterns doc I wrote
in #10577.

Fixes #7255.

Fixes #8636.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test failed in CI: test_self_stat_collection_count

3 participants