xdsclient: Fix race in SetWatchExpiryTimeoutForTesting#8526
xdsclient: Fix race in SetWatchExpiryTimeoutForTesting#8526eshitachandwani merged 12 commits intogrpc:masterfrom
Conversation
|
Test will be added in #8483 along with the fix for LRS Client because the test fails for both race conditions. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8526 +/- ##
==========================================
- Coverage 82.02% 81.98% -0.05%
==========================================
Files 412 413 +1
Lines 40465 40523 +58
==========================================
+ Hits 33191 33222 +31
- Misses 5887 5910 +23
- Partials 1387 1391 +4
🚀 New features to boost your workflow:
|
|
The expiry timeout seems like something that should only be set at the beginning of the test and not something that needs to be configured per xDS client. If so, we can have grpc-go/internal/xds/clients/xdsclient/internal/internal.go Lines 24 to 26 in fa0d658 Removing the setter would give the test control over when the variable is written, making it easier to avoid races. It also avoids surprises when the timeout being passed is not used because a cached client is returned. |
internal/xds/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Outdated
Show resolved
Hide resolved
internal/xds/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Outdated
Show resolved
Hide resolved
| func (c *XDSClient) SetWatchExpiryTimeoutForTesting(watchExpiryTimeout time.Duration) { | ||
| c.watchExpiryTimeout = watchExpiryTimeout | ||
| // with provided timeout value and returns a function to reset the timeout to the | ||
| // original value. |
There was a problem hiding this comment.
We should also mention that this function should not be called concurrently. It should be called before making any RPCs to avoid races.
| // | ||
| // Note: This function should not be called concurrently and must be called | ||
| // before any RPCs to avoid race conditions. | ||
| func SetWatchExpiryTimeoutForTesting(watchExpiryTimeout time.Duration) func() { |
There was a problem hiding this comment.
What I had in mind was to get rid of this function completely. Instead do the following:
- In
Pool.NewClientForTesting, set the value ofWatchExpiryTimeoutdefined ininternal/xds/clients/xdsclient/internal/internal.go, and revert the value as part of the returned cancel function.- Now, if you cannot do that because the value of defined in an internal package and you don't have access to it from the package where
Pool.NewClientForTestingis defined, please add aSetWatchExpiryTimeoutForTestingin the external xdsclient so that it is accessible from here. And in the docstring of that function, please specify that this method should be called before creating the xDS client.
- Now, if you cannot do that because the value of defined in an internal package and you don't have access to it from the package where
- If you end up adding the above specified function, you can even unexport the
WatchExpiryTimeoutvalue as you now have a way to write to it through a function.
With the above approach:
- Existing tests would just have to set the
WatchExpiryTimeoutfield inside ofOptionsForTestingwhen callingNewClientForTestingand will not have to deal with resetting the value, as calling the returned cancel func would end up doing that.
Please let me know if this sounds OK to you. Thanks.
There was a problem hiding this comment.
We actually did try that approach earlier, i.e. this function SetWatchExpiryTimeoutForTesting will do what it does now but will be called from Pool.NewClientForTesting and get reset in the cancel function but it is causing a race between reseting the variable and setting it in a new client impl.
There was a problem hiding this comment.
I see what you are saying. How about the following approach then:
- The generic xDS client currently takes a
Configstruct as part of itsNewfunction. See: https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/xdsconfig.go#L28- We should add a couple of fields in there:
- Watch expiry timeout. Clearly document that most users will not need to set a value for this and that we will use a default value of 15 seconds. And also maybe link to this: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist
- Stream backoff function that returns the duration to backoff when an ADS stream fails without recieving a single response from the server. Again clearly document that most users will not need to set a value for this and that we will use a default exponential backoff as specified here: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md. Mentioning this in the docs might also be an OK thing to do, but I'm not certain of it: https://github.com/grpc/proposal/blob/master/A57-xds-client-failure-mode-behavior.md#handling-ads-stream-termination
- We should add a couple of fields in there:
- From
pool.NewClientandpool.NewClientForTesting, we need to pass the above two pieces of information topool.newRefCountedand from there tonewClientImpl - In
newClientImpl, we need to set the two new fields in the xds config passed to thexdsclient.New
With this approach, the watch expiry timeout will be truly a per-client setting and not something that will be shared with other clients within the same process.
Also, with this approach, we can get rid of the XDSClient.SetWatchExpiryTimeoutForTesting, and also the two variables defined in https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/internal/internal.go.
What do you think?
There was a problem hiding this comment.
This works , it wont cause any race since its specific for each client. I have made the change for the watchExpiryTimeout. Do we also make the change for Stream backoff , because it is being override only testwise and only in one test file and not causing any races?
| // WatchExpiryTimeout is the timeout for xDS resource watch expiry. If | ||
| // unspecified, uses the default value used in non-test code. | ||
| WatchExpiryTimeout time.Duration |
There was a problem hiding this comment.
I'd rather that we not delete this field and instead use this to set the value in the generic xDS client (either directly or through a setter provided by the latter).
db0719e to
fd6c6f8
Compare
|
Please update PR description as it becomes part of our git commit history. |
Fixes: #8525
There is a race in SetWatchExpiryTimeoutForTesting which is used to override the watch expiry timeout of XDSClient for testing. Currently it just sets the watchExpiryTimeout of the XDSClient to the provided value without a mutex each time we call NewClientForTesting which might of might not create a new XDSClient if one is already there.
Fix : Add a new field
WatchExpiryTimeoutto the xdsclient config which will now be used instead ofinternal.WatchExpiryTImeoutRELEASE NOTES: None