stats/opentelemetry: Track retry counts per-call instead of per-attempt#8938
stats/opentelemetry: Track retry counts per-call instead of per-attempt#8938ulascansenturk wants to merge 8 commits intogrpc:masterfrom
Conversation
Fixes: grpc#8299 RELEASE NOTES: - stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are now recorded as span attributes for non-transparent client retries.
…tionDelay This commit fixes the flaky test TestTraceSpan_WithRetriesAndNameResolutionDelay which was introduced in the previous commit and caused PR grpc#8342 to be reverted. Root Cause: The test had race conditions related to timing: 1. The goroutine that updates resolver state could complete before or after the delayed resolution event was fully processed and recorded in spans 2. Span export timing was not synchronized with test validation, causing the test to sometimes check spans before they were fully exported Fix: 1. Added 'stateUpdated' event to synchronize between the resolver state update completing and span validation beginning 2. Added explicit wait for the stateUpdated event before validating spans 3. Added a 50ms sleep after RPC completion to give the span exporter time to process and export all spans before validation Testing: - Test now passes consistently (10+ consecutive runs) - Passes with race detector enabled (-race flag) - No data races detected Fixes grpc#8700
The previous fix removed time.Sleep based on review feedback that waitForTraceSpans should be sufficient. However, waitForTraceSpans only checked for span presence (name + kind), not span completion. This led to test flakiness where spans were captured before all their events and status were fully recorded. Fix: Enhanced waitForTraceSpans to also check that spans have ended (non-zero EndTime) before returning, ensuring all span data is complete before validation. Testing: - Passes 10+ consecutive runs without flakiness - Passes with -race flag - All opentelemetry tests pass
The blocking picker now always returns ErrNoSubConnAvailable and signals when Pick() is called. The ready picker is only produced after the blocking picker has been called, ensuring pickBlocked is always set in picker_wrapper.go regardless of timing. Verified with the time.Sleep repro described in the PR review.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8938 +/- ##
==========================================
- Coverage 83.31% 83.10% -0.21%
==========================================
Files 414 417 +3
Lines 32753 32972 +219
==========================================
+ Hits 27288 27402 +114
- Misses 4064 4134 +70
- Partials 1401 1436 +35
🚀 New features to boost your workflow:
|
|
Sorry about the confusion here. We already have a PR for this: #8923. Unfortunately, the issue that it linked to was the old closed one, which is probably why you didn't see it. I'm going to close this in favor of the existing one, as that one already has comments from a reviewer and progress is being made on that. Hope you can pick up something else from the "Help Wanted" issues. |
Sure, thanks for the heads up ! |
RELEASE NOTES: N/A
fixes #8700
Continuation of #8715 (which could not be reopened).
Changes:
TestTraceSpan_WithRetriesAndNameResolutionDelaytest by using a stub balancer that ensures the blocking picker is picked before the ready picker is produced, eliminating the race condition between LB policy updates and thepick()method