rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease#8856
rls: Fix flaky test TestPickerUpdateOnDataCacheSizeDecrease#8856easwars merged 2 commits intogrpc:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8856 +/- ##
==========================================
- Coverage 83.20% 82.38% -0.83%
==========================================
Files 414 414
Lines 32720 32750 +30
==========================================
- Hits 27226 26981 -245
- Misses 4081 4106 +25
- Partials 1413 1663 +250 🚀 New features to boost your workflow:
|
arjan-bal
left a comment
There was a problem hiding this comment.
The fix looks good. I just have a few comments about simplifying the validation helper.
balancer/rls/balancer_test.go
Outdated
| "cacheSizeBytes": 2 | ||
| "cacheSizeBytes": 1 |
There was a problem hiding this comment.
nit: I believe the change in this line is unnecessary.
There was a problem hiding this comment.
Yes, it is true that this change does not affect correctness or flakiness of the test. A picker update is sent by RLS whenever an entry with a backoff is being evicted from the cache. So, whether we set the size to 1 or 2, the entry corresponding to the first (failed) RPC will be evicted and it is this entry that has a backoff in progress. But I changed it to reflect the comment.
But thinking about it now, I think it is better to leave it at 2 and update the comment instead.
balancer/rls/balancer_test.go
Outdated
| // verifyStateTransitions verifies that the given number of state updates are | ||
| // received on the channel and they match the expected sequence. Adjacent | ||
| // duplicates of the waiting state are ignored. | ||
| func verifyStateTransitions(ctx context.Context, t *testing.T, stateCh <-chan balancer.State, wantNum int, wantStates ...connectivity.State) { |
There was a problem hiding this comment.
As I understand it, this helper performs two jobs:
- Waiting for
wantNumstate updates. - De-duplicating consecutive state updates.
This combination makes it a little hard to read and potentially reuse in other tests. In my opinion, it would be better if this only handled waiting for wantNum and returned the states as a slice to the caller. The caller can then do its own custom validation.
This would make the code easier to understand since the waiting and the de-duplication would happen serially. We could also use slices.Compact or CompactFunc for the de-duplication logic.
There was a problem hiding this comment.
Done. Thanks for the suggestion.
Fixes grpc#7800 #### What was the test doing earlier? - The test was setting up a top-level LB policy that wrapped the RLS policy and added state updates from the latter to a slice that could be inspected by the test. - The test would perform some RPCs which lead to some cache entries being created. - The test would then count the number of state updates received from the RLS LB policy. - Update the cache size in the config and expect a single update from the RLS LB policy. #### Why was it flaking earlier? - Everytime the RLS LB policy sends an RLS request and receives a response, it defers a closure to send a new picker. This is by design. - Everytime the RLS LB policy creates a new pick_first child and pushes it configuration, the latter would send a state update of Connecting. - Subsequently, when the subchannel created by the pick_first moves to Ready, it would send a state update of Ready. - While the state updates from the child policy are pushed to a channel and a separate goroutine handles them in order, they can happen concurrently with the deferred closure (as part of the RLS response). - When the RLS LB policy creates its first child and successfully handles an RPC, we could see state updates of Connecting, Connecting, Ready or Connecting, Ready, Ready based on when the deferred closure runs. - When the RLS LB policy creates its second child and successfully handles an RPC, we would see updates of Ready, Ready, Ready. But if the deferred closure ran last, then the dataplane RPC could have completed before it, and therefore the logic used by the test to count the number of state updates prior to making a config change and expecting exactly one more starts to flake. #### What is the fix? - Instead of counting the number of state updates prior to a config change and expecting exactly one more update, this PR changes the top-level LB policy to push state updates on to a channel and changes the test logic to consume all the expected updates from any operation before moving on to the next. - And because the actual state updates seen are not deterministic (as explained above), the test logic handles them appropriately. Verified with 100k runs on forge. RELEASE NOTES: none
Fixes #7800
What was the test doing earlier?
Why was it flaking earlier?
What is the fix?
Verified with 100k runs on forge.
RELEASE NOTES: none