Fix flaky oximeter-collector agent tests#10040
Merged
mergeconflict merged 3 commits intoMar 17, 2026
Merged
Conversation
jgallagher
reviewed
Mar 13, 2026
bnaecker
approved these changes
Mar 13, 2026
bnaecker
left a comment
Collaborator
There was a problem hiding this comment.
I think this is an improvement, but I agree with John that it's still possible we'll trip the failed_collections.is_empty() assertion. I'm not sure what to do about that, and we might want to simply remove it.
Replace blind simulated-time advancement with condition-based waiting in four tests that were flaky on loaded CI machines (issue #8636). The root cause was that the collection task's bounded channel (capacity 1) would overflow when the timer fired before the previous collection completed, recording spurious FailedCollection entries. The new advance_until() helper advances paused tokio time in small increments, checking a condition each iteration, which gives the runtime a chance to process pending work between timer firings.
a7e2342 to
fb5c399
Compare
The previous advance_until approach still allowed the bounded timer channel to overflow, because it polled an external condition without knowing whether the collection task had actually drained the channel. Replace it with advance_n_collections, which subscribes to the collection task's watch channel and only advances to the next tick after the current collection signals completion. Also strengthen assertions in test_self_stat_error_counter to require exactly one failure type (500) with exactly N_COLLECTIONS count.
fb5c399 to
7c2eade
Compare
bnaecker
reviewed
Mar 17, 2026
bnaecker
left a comment
Collaborator
There was a problem hiding this comment.
This is great! I like the new approach of taking one collection at a time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #8636 (and likely the same root cause as #7255 and #7220).
Three agent tests have been reported as flaky on loaded CI machines:
test_self_stat_collection_count(#8636, #4657),test_self_stat_error_counter(#7255), andverify_producer_details(#7220). A fourth,test_self_stat_unreachable_counter, uses the same pattern and was likely susceptible too.They all shared the same approach: pause tokio time, advance by a fixed
TEST_WAIT_PERIOD, then assert. On slow machines, the collection task couldn't keep up with the timer: its bounded channel (capacity 1) would overflow when the next tick fired before the previous collection finished, recording spuriousCollectionsInProgressfailures.The fix introduces
advance_n_collections(), which advances paused time in small increments (TICK_INTERVAL), using a watch channel to wait for each collection to complete before moving on to the next. This guarantees the channel is drained before the next timer tick fires, so collections never overlap.Because collections can no longer overlap,
test_self_stat_error_countergets substantially simpler: the old code had to tolerate bothCollectionsInProgressand500failure reasons (summing across them) and allowed+1slack between the server-side and task-side counts. Now it can assert exactlyN_COLLECTIONSerrors of a single type and exact count equality.test_updated_producer_is_still_collected_fromalso gets a more precise wait loop after re-registration: instead of advancing time by a fixed amount and hoping a collection happened, it watches for the server-sidecollection_countto confirm an actual collection from the new server (as opposed to just a producer info update from re-registration).Also adds
CollectionTaskHandle::details_watcher()(#[cfg(test)]) to expose the watch receiver.