Be able to correlate timeouts in reverse-proxy layer in front of Synapse (tag specific request headers)#13786
Conversation
…pse (tag specific request headers) Fix #13685
| # Tag any headers that we need to extract from the request. This | ||
| # is useful to specify any headers that a reverse-proxy in front | ||
| # of Synapse may be sending to correlate and match up something | ||
| # in that layer to a Synapse trace. ex. when Cloudflare times | ||
| # out it gives a `cf-ray` header which we can also tag here to | ||
| # find the trace. | ||
| for header_name in self.hs.config.tracing.request_headers_to_tag: | ||
| headers = request.requestHeaders.getRawHeaders(header_name) | ||
| if headers is not None: | ||
| parent_span.set_tag( | ||
| SynapseTags.REQUEST_HEADER_PREFIX + header_name, | ||
| str(headers[0]), | ||
| ) |
There was a problem hiding this comment.
I've tested that this works locally but I'm not sure if the cf-ray header that we want to use on matrix.org is actually a thing.
Based on the following documentation and that we see it in our haproxy logs already, I assume this will be viable,
Add the
CF-Rayheader to your origin web server logs to match requests proxied to Cloudflare to requests in your server logs.
| if headers is not None: | ||
| parent_span.set_tag( | ||
| SynapseTags.REQUEST_HEADER_PREFIX + header_name, | ||
| str(headers[0]), |
There was a problem hiding this comment.
Why the call to str? If header_name is a string, then headers[0] will be a string.
See:
| headers = request.requestHeaders.getRawHeaders(header_name) | ||
| if headers is not None: | ||
| parent_span.set_tag( | ||
| SynapseTags.REQUEST_HEADER_PREFIX + header_name, |
There was a problem hiding this comment.
I'm not really familiar with opentracing tags, is request_header.cf-ray an OK thing to do? Or do we need to normalize that to something like request_header.cf_ray or whatever?
There was a problem hiding this comment.
The tag can be whatever. And since it's the name of a header, I think it's probably easiest to follow with the actual header name. Similar to what we do with function arguments but those happen to conform to the Python snake casing already.
These are the OpenTelemetry docs but here are some naming conventions, https://opentelemetry.io/docs/reference/specification/common/attribute-naming/
| * `request_headers_to_tag`: A list of headers to extract from the request and | ||
| add to to the top-level servlet tracing span as tags. Useful when you're using | ||
| a reverse proxy service like Cloudflare to protect your Synapse instance in | ||
| order to correlate and match up requests that timed out at the Cloudflare | ||
| layer to the Synapse traces. |
There was a problem hiding this comment.
Not to throw a wrench in these plans, but if other reverse proxies do something similar; I wonder if it would make more sense to have a header which could be the source for SynapseRequest.request_id (instead of generating a new one):
Lines 174 to 175 in 9772e36
Would that be clearer instead of having two separate IDs?
There was a problem hiding this comment.
👍 Seems reasonable to me and a little cleaner. I'll create a new PR with this solution ⏩
There was a problem hiding this comment.
I don't know if this would make it harder to track things in the logs or anything, but seems like it would be a lot cleaner.
There was a problem hiding this comment.
Tried this out in #13801
Losing the simplicity of the incrementing integer to reference in logs as you scroll kinda sucks though. I'm leaning back to this option I think 🤔
There was a problem hiding this comment.
Losing the simplicity of the incrementing integer to reference in logs as you scroll kinda sucks though. I'm leaning back to this option I think 🤔
Is that really used though? I usually just copy and paste for grep. Once the number gets beyond a few digits it is useless anyway IMO.
There was a problem hiding this comment.
I'm not the one having to log dive and debug big Synapse instances so I don't know. In tests, we will still have the same sequential request ID's since this config won't be defined. We're not changing any default either so it's basically opt-in.
If anyone has a strong desire for the old way, we can always revert as well. If you're happy with #13801, we can move forward with that ⏩
There was a problem hiding this comment.
Sounds like we should ask folks real quick then!
There was a problem hiding this comment.
Thanks for asking in #synapse-dev, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$kMyUMCOU4fZpH6AOUPOIqGyZADn38ebt2kn2wo9cAXU?via=matrix.org&via=element.io&via=beeper.com ❤️
Discussion being tracked in #13801 (comment)
|
Closing as it seems we're settling on the alternative #13801 |
Add
opentracing.request_headers_to_tagconfig to specify which request headers extract and tag traces with in order to correlate timeouts in reverse-proxy layers in front of Synapse with traces.Fix #13685
Alternative solution: #13801
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.Pull request includes a sign off(run the linters)