rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease by easwars · Pull Request #8856 · grpc/grpc-go

easwars · 2026-01-22T07:27:57Z

Fixes #7800

What was the test doing earlier?

The test was setting up a top-level LB policy that wrapped the RLS policy and added state updates from the latter to a slice that could be inspected by the test.
The test would perform some RPCs which lead to some cache entries being created.
The test would then count the number of state updates received from the RLS LB policy.
Update the cache size in the config and expect a single update from the RLS LB policy.

Why was it flaking earlier?

Everytime the RLS LB policy sends an RLS request and receives a response, it defers a closure to send a new picker. This is by design.
Everytime the RLS LB policy creates a new pick_first child and pushes it configuration, the latter would send a state update of Connecting.
Subsequently, when the subchannel created by the pick_first moves to Ready, it would send a state update of Ready.
While the state updates from the child policy are pushed to a channel and a separate goroutine handles them in order, they can happen concurrently with the deferred closure (as part of the RLS response).
When the RLS LB policy creates its first child and successfully handles an RPC, we could see state updates of Connecting, Connecting, Ready or Connecting, Ready, Ready based on when the deferred closure runs.
When the RLS LB policy creates its second child and successfully handles an RPC, we would see updates of Ready, Ready, Ready. But if the deferred closure ran last, then the dataplane RPC could have completed before it, and therefore the logic used by the test to count the number of state updates prior to making a config change and expecting exactly one more starts to flake.

What is the fix?

Instead of counting the number of state updates prior to a config change and expecting exactly one more update, this PR changes the top-level LB policy to push state updates on to a channel and changes the test logic to consume all the expected updates from any operation before moving on to the next.
And because the actual state updates seen are not deterministic (as explained above), the test logic handles them appropriately.

Verified with 100k runs on forge.

RELEASE NOTES: none

codecov · 2026-01-22T07:31:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.38%. Comparing base (1c79a7c) to head (743b218).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8856      +/-   ##
==========================================
- Coverage   83.20%   82.38%   -0.83%     
==========================================
  Files         414      414              
  Lines       32720    32750      +30     
==========================================
- Hits        27226    26981     -245     
- Misses       4081     4106      +25     
- Partials     1413     1663     +250

see 26 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arjan-bal

The fix looks good. I just have a few comments about simplifying the validation helper.

arjan-bal · 2026-01-23T17:59:00Z

balancer/rls/balancer_test.go

-			"cacheSizeBytes": 2
+			"cacheSizeBytes": 1


nit: I believe the change in this line is unnecessary.

Yes, it is true that this change does not affect correctness or flakiness of the test. A picker update is sent by RLS whenever an entry with a backoff is being evicted from the cache. So, whether we set the size to 1 or 2, the entry corresponding to the first (failed) RPC will be evicted and it is this entry that has a backoff in progress. But I changed it to reflect the comment.

But thinking about it now, I think it is better to leave it at 2 and update the comment instead.

arjan-bal · 2026-01-23T18:09:57Z

balancer/rls/balancer_test.go

+// verifyStateTransitions verifies that the given number of state updates are
+// received on the channel and they match the expected sequence. Adjacent
+// duplicates of the waiting state are ignored.
+func verifyStateTransitions(ctx context.Context, t *testing.T, stateCh <-chan balancer.State, wantNum int, wantStates ...connectivity.State) {


As I understand it, this helper performs two jobs:

Waiting for wantNum state updates.

De-duplicating consecutive state updates.

This combination makes it a little hard to read and potentially reuse in other tests. In my opinion, it would be better if this only handled waiting for wantNum and returned the states as a slice to the caller. The caller can then do its own custom validation.

This would make the code easier to understand since the waiting and the de-duplication would happen serially. We could also use slices.Compact or CompactFunc for the de-duplication logic.

Done. Thanks for the suggestion.

Fixes grpc#7800 #### What was the test doing earlier? - The test was setting up a top-level LB policy that wrapped the RLS policy and added state updates from the latter to a slice that could be inspected by the test. - The test would perform some RPCs which lead to some cache entries being created. - The test would then count the number of state updates received from the RLS LB policy. - Update the cache size in the config and expect a single update from the RLS LB policy. #### Why was it flaking earlier? - Everytime the RLS LB policy sends an RLS request and receives a response, it defers a closure to send a new picker. This is by design. - Everytime the RLS LB policy creates a new pick_first child and pushes it configuration, the latter would send a state update of Connecting. - Subsequently, when the subchannel created by the pick_first moves to Ready, it would send a state update of Ready. - While the state updates from the child policy are pushed to a channel and a separate goroutine handles them in order, they can happen concurrently with the deferred closure (as part of the RLS response). - When the RLS LB policy creates its first child and successfully handles an RPC, we could see state updates of Connecting, Connecting, Ready or Connecting, Ready, Ready based on when the deferred closure runs. - When the RLS LB policy creates its second child and successfully handles an RPC, we would see updates of Ready, Ready, Ready. But if the deferred closure ran last, then the dataplane RPC could have completed before it, and therefore the logic used by the test to count the number of state updates prior to making a config change and expecting exactly one more starts to flake. #### What is the fix? - Instead of counting the number of state updates prior to a config change and expecting exactly one more update, this PR changes the top-level LB policy to push state updates on to a channel and changes the test logic to consume all the expected updates from any operation before moving on to the next. - And because the actual state updates seen are not deterministic (as explained above), the test logic handles them appropriately. Verified with 100k runs on forge. RELEASE NOTES: none

rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease

54944e2

easwars added the Type: Testing label Jan 22, 2026

easwars added this to the 1.80 Release milestone Jan 22, 2026

easwars requested a review from arjan-bal January 22, 2026 07:28

easwars assigned arjan-bal Jan 22, 2026

arjan-bal approved these changes Jan 23, 2026

View reviewed changes

arjan-bal assigned easwars and unassigned arjan-bal Jan 23, 2026

review comments

743b218

easwars merged commit 7985bb4 into grpc:master Jan 29, 2026
14 checks passed

easwars deleted the TestPickerUpdateOnDataCacheSizeDecrease branch January 29, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease#8856

rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease#8856
easwars merged 2 commits intogrpc:masterfrom
easwars:TestPickerUpdateOnDataCacheSizeDecrease

easwars commented Jan 22, 2026

Uh oh!

codecov bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

arjan-bal left a comment

Uh oh!

arjan-bal Jan 23, 2026

Uh oh!

easwars Jan 29, 2026 •

edited

Loading

Uh oh!

arjan-bal Jan 23, 2026

Uh oh!

easwars Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easwars commented Jan 22, 2026

What was the test doing earlier?

Why was it flaking earlier?

What is the fix?

Uh oh!

codecov bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

arjan-bal Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

easwars Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjan-bal Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

easwars Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jan 22, 2026 •

edited

Loading

easwars Jan 29, 2026 •

edited

Loading