Skip to content

stats/opentelemetry: Track retry counts per-call instead of per-attempt#8938

Closed
ulascansenturk wants to merge 8 commits intogrpc:masterfrom
ulascansenturk:fix/8700-otel-retry-per-call-tracking
Closed

stats/opentelemetry: Track retry counts per-call instead of per-attempt#8938
ulascansenturk wants to merge 8 commits intogrpc:masterfrom
ulascansenturk:fix/8700-otel-retry-per-call-tracking

Conversation

@ulascansenturk
Copy link
Copy Markdown
Contributor

RELEASE NOTES: N/A
fixes #8700

Continuation of #8715 (which could not be reopened).

Changes:

vinothkumarr227 and others added 8 commits January 14, 2026 18:21
Fixes: grpc#8299

RELEASE NOTES:

- stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are
now recorded as span attributes for non-transparent client retries.
…tionDelay

This commit fixes the flaky test TestTraceSpan_WithRetriesAndNameResolutionDelay
which was introduced in the previous commit and caused PR grpc#8342 to be reverted.

Root Cause:
The test had race conditions related to timing:
1. The goroutine that updates resolver state could complete before or after
   the delayed resolution event was fully processed and recorded in spans
2. Span export timing was not synchronized with test validation, causing
   the test to sometimes check spans before they were fully exported

Fix:
1. Added 'stateUpdated' event to synchronize between the resolver state
   update completing and span validation beginning
2. Added explicit wait for the stateUpdated event before validating spans
3. Added a 50ms sleep after RPC completion to give the span exporter time
   to process and export all spans before validation

Testing:
- Test now passes consistently (10+ consecutive runs)
- Passes with race detector enabled (-race flag)
- No data races detected

Fixes grpc#8700
The previous fix removed time.Sleep based on review feedback that
waitForTraceSpans should be sufficient. However, waitForTraceSpans only
checked for span presence (name + kind), not span completion.

This led to test flakiness where spans were captured before all their
events and status were fully recorded.

Fix: Enhanced waitForTraceSpans to also check that spans have ended
(non-zero EndTime) before returning, ensuring all span data is complete
before validation.

Testing:
- Passes 10+ consecutive runs without flakiness
- Passes with -race flag
- All opentelemetry tests pass
The blocking picker now always returns ErrNoSubConnAvailable and signals
when Pick() is called. The ready picker is only produced after the
blocking picker has been called, ensuring pickBlocked is always set in
picker_wrapper.go regardless of timing. Verified with the time.Sleep
repro described in the PR review.
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.10%. Comparing base (572fdca) to head (abd6440).
⚠️ Report is 55 commits behind head on master.

Files with missing lines Patch % Lines
stats/opentelemetry/client_tracing.go 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8938      +/-   ##
==========================================
- Coverage   83.31%   83.10%   -0.21%     
==========================================
  Files         414      417       +3     
  Lines       32753    32972     +219     
==========================================
+ Hits        27288    27402     +114     
- Misses       4064     4134      +70     
- Partials     1401     1436      +35     
Files with missing lines Coverage Δ
stats/opentelemetry/opentelemetry.go 75.12% <ø> (ø)
stats/opentelemetry/trace.go 90.32% <ø> (-1.99%) ⬇️
stats/opentelemetry/client_tracing.go 86.36% <78.57%> (-2.73%) ⬇️

... and 67 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars
Copy link
Copy Markdown
Contributor

easwars commented Feb 25, 2026

@ulascansenturk

Sorry about the confusion here. We already have a PR for this: #8923. Unfortunately, the issue that it linked to was the old closed one, which is probably why you didn't see it.

I'm going to close this in favor of the existing one, as that one already has comments from a reviewer and progress is being made on that.

Hope you can pick up something else from the "Help Wanted" issues.

@easwars easwars closed this Feb 25, 2026
@ulascansenturk
Copy link
Copy Markdown
Contributor Author

@ulascansenturk

Sorry about the confusion here. We already have a PR for this: #8923. Unfortunately, the issue that it linked to was the old closed one, which is probably why you didn't see it.

I'm going to close this in favor of the existing one, as that one already has comments from a reviewer and progress is being made on that.

Hope you can pick up something else from the "Help Wanted" issues.

Sure, thanks for the heads up !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

otel: retries must be tracked per-call and not per-attempt

3 participants