-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Adjust timeout handling in client-side operations to account for RTT variations #1793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…variations. JAVA-5375
# Conflicts: # driver-reactive-streams/src/test/functional/com/mongodb/reactivestreams/client/ClientSideOperationTimeoutProseTest.java # driver-sync/src/test/functional/com/mongodb/client/AbstractClientSideOperationsTimeoutProseTest.java # driver-sync/src/test/functional/com/mongodb/client/unified/UnifiedTestModifications.java
| List<CommandStartedEvent> commandStartedEvents = commandListener.getCommandStartedEvents(); | ||
| assertCommandStartedEventsInOder(Arrays.asList("aggregate", "getMore", "getMore", "getMore", "killCursors"), | ||
| commandStartedEvents); | ||
| assertOnlyOneCommandTimeoutFailure("getMore"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that the last getMore command will fail with a timeout. If maxAwaitTimeMS is higher than the remaining overall timeout, the driver uses the remaining timeout as maxTimeMS.
This means a response can arrive right before expiration (especially when RTT is near 0), and the driver can throw a timeout on the next operation before sending the next getMore command.
Example from logs:
* [2025/12/30 19:25:05.019] STARTED: aggregate - {"aggregate": "ClientSideOperationTimeoutProseTest_42", "readConcern": {"level": "majority"}, "pipeline": [{"$changeStream": {}}], "cursor": {"batchSize": 2}, "maxTimeMS": 2495, "$db": "JavaDriverTest", "lsid": {"id": {"$binary": {"base64": "nSsnGd4rT3i2tTKN/Xo98A==", "subType": "04"}}}}
* [2025/12/30 19:25:05.019] SUCCEEDED: aggregate (elapsed: 1ms)
* [2025/12/30 19:25:05.019] STARTED: getMore - {"getMore": 5528154821143727891, "collection": "ClientSideOperationTimeoutProseTest_42", "batchSize": 2, "maxTimeMS": 1000, "$db": "JavaDriverTest", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1767151095, "i": 4}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "lsid": {"id": {"$binary": {"base64": "nSsnGd4rT3i2tTKN/Xo98A==", "subType": "04"}}}}
* [2025/12/30 19:25:05.019] SUCCEEDED: getMore (elapsed: 999ms)
* [2025/12/30 19:25:05.019] STARTED: getMore - {"getMore": 5528154821143727891, "collection": "ClientSideOperationTimeoutProseTest_42", "batchSize": 2, "maxTimeMS": 1000, "$db": "JavaDriverTest", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1767151095, "i": 4}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "lsid": {"id": {"$binary": {"base64": "nSsnGd4rT3i2tTKN/Xo98A==", "subType": "04"}}}}
* [2025/12/30 19:25:05.019] SUCCEEDED: getMore (elapsed: 999ms)
* [2025/12/30 19:25:05.019] STARTED: getMore - {"getMore": 5528154821143727891, "collection": "ClientSideOperationTimeoutProseTest_42", "batchSize": 2, "maxTimeMS": 499, "$db": "JavaDriverTest", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1767151097, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "lsid": {"id": {"$binary": {"base64": "nSsnGd4rT3i2tTKN/Xo98A==", "subType": "04"}}}}
* [2025/12/30 19:25:05.019] SUCCEEDED: getMore (elapsed: 498ms)
* [2025/12/30 19:25:05.019] STARTED: killCursors - {"killCursors": "ClientSideOperationTimeoutProseTest_42", "cursors": [5528154821143727891], "maxTimeMS": 1000, "$db": "JavaDriverTest", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1767151097, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "lsid": {"id": {"$binary": {"base64": "nSsnGd4rT3i2tTKN/Xo98A==", "subType": "04"}}}}
* [2025/12/30 19:25:05.019] SUCCEEDED: killCursors (elapsed: 0ms)
* [2025/12/30 19:25:05.019] STARTED: endSessions - {"endSessions": [{"id": {"$binary": {"base64": "nSsnGd4rT3i2tTKN/Xo98A==", "subType": "04"}}}], "$db": "admin", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1767151097, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "$readPreference": {"mode": "primaryPreferred"}}
* [2025/12/30 19:25:05.019] SUCCEEDED: endSessions (elapsed: 0ms)
This was one of the cases for flakiness in this test.
This PR removes the ClusterFixture.getPrimaryRTT()- based adjustment in CSOT Prose Tests, because in practice the adjustment collapses to 0ms (after min-subtraction) and introduces unnecessary order-dependent noise/flakiness.
Problem
We rely on ClusterFixture.getPrimaryRTT() to adjust for RTT fluctuations across environment.
ClusterFixture.getPrimaryRTT() reports a running average across 10 samples from its monitor thread. Test execution order is not deterministic, so CSOT may execute before the cluster has accumulated stable RTT samples. In that case, the first samples can be dominated by the first handshake, which has higher RTT due to spending more time on the server side, which skews the running average.
observed RTT distributions (examples)
--- config: useMaxWidth: false xyChart: width: 520 height: 320 --- xychart-beta title "ClusterFixture.getPrimaryRTT() - Total samples: 15" x-axis "Histogram (bucket size: 1ms)" ["0-0ms","1-1ms","8-8ms","17-17ms","33-33ms","41-41ms","51-51ms"] y-axis "samples" 0 --> 7 bar [7,2,2,1,1,1,1]--- config: xyChart: width: 520 height: 320 --- xychart-beta title "Getting min RTT in TimeoutContext - Total samples: 209" x-axis "Histogram (bucket size: 1ms)" ["0-0ms"] y-axis "samples" 0 --> 209 bar [209]Root cause 1:
A per-source histogram shows that “Opening connection” samples can include extreme outliers (up to ~1020ms), while subsequent samples immediately return to near-zero. Because getPrimaryRTT() is a running average over 10 samples, these early outliers can dominate the reported RTT when CSOT runs early.
--- config: useMaxWidth: false xyChart: width: 980 height: 560 --- xychart-beta title "Histogram by source (bucket size: 10ms)" x-axis ["0-9ms","10-19ms","20-29ms","30-39ms","40-49ms","50-59ms","60-69ms","70-79ms","80-89ms","90-99ms","100-109ms","110-119ms","120-129ms","140-149ms","150-159ms","190-199ms","500-509ms","1010-1019ms","1020-1029ms"] y-axis "samples" 0 --> 7751 bar "Total" [7751,52,22,8,10,7,5,2,5,1,4,2,3,1,1,1,1,1,1] bar "O=Opening connection" [4324,13,7,1,3,1,1,0,2,1,2,1,2,1,1,1,1,1,1] bar "H=Heartbeat" [2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] bar "R=RTT monitor" [3425,39,15,7,7,6,4,2,3,0,2,1,1,0,0,0,0,0,0]Root cause 2: unrelated test polluted cluster RTT monitor and cascaded into later tests
A separate source of skew was identified: shouldUseConnectTimeoutMsWhenEstablishingConnectionInBackground blocked "hello" / "isMaster" for ~500ms without using a uniquely scoped failpoint. This affected the shared ClusterFixture monitor, causing elevated RTT readings and cascading into subsequent tests, increasing flakiness.
This issue is independent from test ordering: even if CSOT runs later, a shared-monitor disruption can inflate RTT and affect any tests that rely on it.
Post-fix observation: After addressing the failpoint / blocking behavior, ClusterFixture.getPrimaryRTT() distributions collapsed to all-zeros in the observed runs:
--- config: useMaxWidth: false xyChart: width: 520 height: 260 --- xychart-beta title "ClusterFixture.getPrimaryRTT() - Total samples: 40" x-axis "Histogram (bucket size: 1ms)" ["0-0ms"] y-axis "samples" 0 --> 40 bar [40]--- config: useMaxWidth: false xyChart: width: 520 height: 260 --- xychart-beta title "ClusterFixture.getPrimaryRTT() - Total samples: 49" x-axis "Histogram (bucket size: 1ms)" ["0-0ms"] y-axis "samples" 0 --> 49 bar [49]Change in this PR
JAVA-5375