Skip to content

rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease#8856

Merged
easwars merged 2 commits intogrpc:masterfrom
easwars:TestPickerUpdateOnDataCacheSizeDecrease
Jan 29, 2026
Merged

rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease#8856
easwars merged 2 commits intogrpc:masterfrom
easwars:TestPickerUpdateOnDataCacheSizeDecrease

Conversation

@easwars
Copy link
Copy Markdown
Contributor

@easwars easwars commented Jan 22, 2026

Fixes #7800

What was the test doing earlier?

  • The test was setting up a top-level LB policy that wrapped the RLS policy and added state updates from the latter to a slice that could be inspected by the test.
  • The test would perform some RPCs which lead to some cache entries being created.
  • The test would then count the number of state updates received from the RLS LB policy.
  • Update the cache size in the config and expect a single update from the RLS LB policy.

Why was it flaking earlier?

  • Everytime the RLS LB policy sends an RLS request and receives a response, it defers a closure to send a new picker. This is by design.
  • Everytime the RLS LB policy creates a new pick_first child and pushes it configuration, the latter would send a state update of Connecting.
  • Subsequently, when the subchannel created by the pick_first moves to Ready, it would send a state update of Ready.
  • While the state updates from the child policy are pushed to a channel and a separate goroutine handles them in order, they can happen concurrently with the deferred closure (as part of the RLS response).
  • When the RLS LB policy creates its first child and successfully handles an RPC, we could see state updates of Connecting, Connecting, Ready or Connecting, Ready, Ready based on when the deferred closure runs.
  • When the RLS LB policy creates its second child and successfully handles an RPC, we would see updates of Ready, Ready, Ready. But if the deferred closure ran last, then the dataplane RPC could have completed before it, and therefore the logic used by the test to count the number of state updates prior to making a config change and expecting exactly one more starts to flake.

What is the fix?

  • Instead of counting the number of state updates prior to a config change and expecting exactly one more update, this PR changes the top-level LB policy to push state updates on to a channel and changes the test logic to consume all the expected updates from any operation before moving on to the next.
  • And because the actual state updates seen are not deterministic (as explained above), the test logic handles them appropriately.

Verified with 100k runs on forge.

RELEASE NOTES: none

@easwars easwars added this to the 1.80 Release milestone Jan 22, 2026
@easwars easwars requested a review from arjan-bal January 22, 2026 07:28
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.38%. Comparing base (1c79a7c) to head (743b218).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8856      +/-   ##
==========================================
- Coverage   83.20%   82.38%   -0.83%     
==========================================
  Files         414      414              
  Lines       32720    32750      +30     
==========================================
- Hits        27226    26981     -245     
- Misses       4081     4106      +25     
- Partials     1413     1663     +250     

see 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good. I just have a few comments about simplifying the validation helper.

Comment on lines +812 to +855
"cacheSizeBytes": 2
"cacheSizeBytes": 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe the change in this line is unnecessary.

Copy link
Copy Markdown
Contributor Author

@easwars easwars Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is true that this change does not affect correctness or flakiness of the test. A picker update is sent by RLS whenever an entry with a backoff is being evicted from the cache. So, whether we set the size to 1 or 2, the entry corresponding to the first (failed) RPC will be evicted and it is this entry that has a backoff in progress. But I changed it to reflect the comment.

But thinking about it now, I think it is better to leave it at 2 and update the comment instead.

Comment on lines +1306 to +1309
// verifyStateTransitions verifies that the given number of state updates are
// received on the channel and they match the expected sequence. Adjacent
// duplicates of the waiting state are ignored.
func verifyStateTransitions(ctx context.Context, t *testing.T, stateCh <-chan balancer.State, wantNum int, wantStates ...connectivity.State) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this helper performs two jobs:

  1. Waiting for wantNum state updates.
  2. De-duplicating consecutive state updates.

This combination makes it a little hard to read and potentially reuse in other tests. In my opinion, it would be better if this only handled waiting for wantNum and returned the states as a slice to the caller. The caller can then do its own custom validation.

This would make the code easier to understand since the waiting and the de-duplication would happen serially. We could also use slices.Compact or CompactFunc for the de-duplication logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the suggestion.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Jan 23, 2026
@easwars easwars merged commit 7985bb4 into grpc:master Jan 29, 2026
14 checks passed
@easwars easwars deleted the TestPickerUpdateOnDataCacheSizeDecrease branch January 29, 2026 19:10
mbissa pushed a commit to mbissa/grpc-go that referenced this pull request Feb 16, 2026
Fixes grpc#7800

#### What was the test doing earlier?
- The test was setting up a top-level LB policy that wrapped the RLS
policy and added state updates from the latter to a slice that could be
inspected by the test.
- The test would perform some RPCs which lead to some cache entries
being created.
- The test would then count the number of state updates received from
the RLS LB policy.
- Update the cache size in the config and expect a single update from
the RLS LB policy.

#### Why was it flaking earlier?
- Everytime the RLS LB policy sends an RLS request and receives a
response, it defers a closure to send a new picker. This is by design.
- Everytime the RLS LB policy creates a new pick_first child and pushes
it configuration, the latter would send a state update of Connecting.
- Subsequently, when the subchannel created by the pick_first moves to
Ready, it would send a state update of Ready.
- While the state updates from the child policy are pushed to a channel
and a separate goroutine handles them in order, they can happen
concurrently with the deferred closure (as part of the RLS response).
- When the RLS LB policy creates its first child and successfully
handles an RPC, we could see state updates of Connecting, Connecting,
Ready or Connecting, Ready, Ready based on when the deferred closure
runs.
- When the RLS LB policy creates its second child and successfully
handles an RPC, we would see updates of Ready, Ready, Ready. But if the
deferred closure ran last, then the dataplane RPC could have completed
before it, and therefore the logic used by the test to count the number
of state updates prior to making a config change and expecting exactly
one more starts to flake.

#### What is the fix?
- Instead of counting the number of state updates prior to a config
change and expecting exactly one more update, this PR changes the
top-level LB policy to push state updates on to a channel and changes
the test logic to consume all the expected updates from any operation
before moving on to the next.
- And because the actual state updates seen are not deterministic (as
explained above), the test logic handles them appropriately.

Verified with 100k runs on forge.

RELEASE NOTES: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: 24/100k: TestPickerUpdateOnDataCacheSizeDecrease

2 participants