Update transient errors retry timeout and retryable status codes#1842
Conversation
| @@ -289,8 +286,6 @@ def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT: | |||
| if error.code() in [ | |||
| StatusCode.CANCELLED, | |||
| StatusCode.DEADLINE_EXCEEDED, | |||
There was a problem hiding this comment.
Aren't cancelled and deadline exceeded errors raised by the client itself? Not 100% sure about deadline exceeded but I think cancelled should not be retried as it means (if I understand correctly) that the client decided to cancel the request for example in case it received some short of shutdown/exit signal.
There was a problem hiding this comment.
It is typically client but I believe this is allowed to cover the case when server does it https://grpc.io/docs/what-is-grpc/core-concepts/#cancelling-an-rpc.
There was a problem hiding this comment.
And I found there is a table which lists what is retryable https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlpgrpc-response.
There was a problem hiding this comment.
I see. Not a blocker for me then but I wonder if is there a simple way for the exporter to cancel an in-flight request on shutdown (with or without timeout)?
There was a problem hiding this comment.
Ah I understand your concern. Anyway we have some work to do in making shutdown spec complaint we can address this there.
|
|
||
| def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT: | ||
|
|
||
| max_value = 64 |
There was a problem hiding this comment.
Would it be possible to make this configurable? For example as an export_timeout: maximum duration for exporting spans argument to the exporter?
My use case is CLI applications where anything having them hang on exit for 64 seconds because a telemetry backend is down is unacceptable. For comparison the Go implementation has two separate timeouts ExportTimeout for the full operation and BatchTimeout for the direct publish.
For now I have special subclass with a copy of this method to tweak the behavior but it's a maintenance issue.
There was a problem hiding this comment.
I will bring up the topic of supporting configurability in next SIG meeting. Spec is not clear(at this point) in that regard and anything we do might become breaking change in future. And speaking of ExportTimeout and BatchTimeout python also has them as schedule_delay_millis and export_timeout_millis here I don't know why but we are only enforcing the export timeout only during force flush.
There was a problem hiding this comment.
Talked about this in SIG call. This is out of scope for this. Consensus was this is not clear from Spec so we might end up making backwards incompatible change by introducing the support for configuring this. One thing we want to explore is making retry async and not blocking and also looking at how span processor timeout is behaving. Please create another issue and we will track it there.
Description
We agreed that OTLP exporter transient error retry timeout is way to high and change it to some reasonable number. This also updates the retryable codes list for grpc;
PERMISSION_DENIEDandUNAUTHENTICATEDdo not belong to that list.Fixes #1670