Add StackTraceLimit and SpanFramesMinDurationInMilliseconds configs#374
Add StackTraceLimit and SpanFramesMinDurationInMilliseconds configs#374gregkalapos merged 28 commits intoelastic:masterfrom gregkalapos:ControlStackTrace
Conversation
Plus rename CapturedException.Stacktrace to StackTrace to be in sync with Span.StackTrace
|
Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test. Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷♂️ |
SergeyKleyman
left a comment
There was a problem hiding this comment.
LGTM other than a few minor issues and one major one - possible performance issue with parsing configuration each check if stack trace should be taken.
I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines? |
I merged something today, and with that all tests were green: 0b66a32 (only thing made it red is that the test coverage is below target, but that's a known bug and the next build will make it good again.) - I think there is no change between this branch and master in terms of jenkins and CI related things. |
And spanFramesMinDurationInMilliseconds -1 for old tests to not change behaviour.
|
The With that constructor it's possible to create single stack frames instead of getting the whole stack trace with Which sounds very promising, since e.g. if we only need 10 frames we could just create the 10 So, something like this in I implemented this, but it does not seem to help. Test case: 100 Transactions each with 10Spans, Original code (no manual Adding this "optimization": Clearly no improvement, won't commit 😔 |
Sooo.. I think the I'm thinking about dropping that test, since it causes more trouble than added value. |
The question is why I would not recommend removing tests before understanding why they fail - to me it contradicts the whole purpose of the tests. Also I would not recommend deferring investigation to a separate issue since it's very risky to merge this PR before understanding why tests fail. But it's your decision to make. |
Already talking to @v1v. We try to figure this out. Nevertheless the test skipped if there are no |
| default: throw new ArgumentException( "Unexpected TimeSuffix value", nameof(defaultSuffix)); | ||
| default: | ||
| valueInMilliseconds = -1; | ||
| return false; |
There was a problem hiding this comment.
Are you sure you want to just swallow incorrect input (which most likely means a bug in agents code)? For example TryParse(String, NumberStyles, IFormatProvider, Int32)
ArgumentException
style is not a NumberStyles value.
There was a problem hiding this comment.
I reverted it, but if we throw an ArgumentException then we have to handle it - not being able to parse these values should not crash the agent even if it's an agent bug (I hope we agree at least on this).
If that's the case then we should handle it in the caller method, which is basically the same as when the method returns false - so from a single case, which is not being able to parse and returning false we made 2 cases: 1) returning false, 2) exception. Now we have to handle both, but the handling logic is the same.
Here is the commit: cddbba5 - I think this code is very noise and repeats the same.
Nevertheless I think this is not that important - let's keep it this way.
There was a problem hiding this comment.
It's not the same - default units passed as a separate parameter cannot be affected by user's input that means if the units are invalid it's a bug in agent's code (or maybe some mix of assemblies from different incompatible versions, etc.) Unlike invalid configuration option value which is just an error in logging terms, the bug with invalid units is a critical event. I think in case of a critical event there's no recovery and fail-fast approach should be used which for application usually means - record the current state to help with investigation and exit as soon as possible to prevent further possibly much more disastrous business data corruption. For the agent the second part is trickier - we can consider if killing the whole application might be an option users might want to have but at the very least agent should definitely stop doing ANYTHING.
There was a problem hiding this comment.
not being able to parse these values should not crash the agent even if it's an agent bug (I hope we agree at least on this).
Ok, we don't even agree on that one. 🤷♂️
I think it makes sense to not spend too much on it. I'll do whichever you want. Let me know if you want further changes.
There was a problem hiding this comment.
I think you just misread my note - I'm not talking about user's input which of course can contain invalid units, I'm talking about parameters passed by agent's code which should be valid unless there is a bug.
There was a problem hiding this comment.
Yes, I know that. I just think we overcomplicate things.
Adapting code to your suggestion.
There was a problem hiding this comment.
I'm not sure I understand what you mean - if there are no line numbers at prod then why do we need all the agent production code and the tests dealing with the case when there are line numbers available? |
Maybe I wrote too quickly. So, if there are no pdb files, even with
My suggestion is that the tests also adapts to this and it will be skipped when there are no pdbs, instead of failing. The reasoning is that without pdbs making the tests fail would be false positive, since there is no way to make the feature work without the pdbs... that's not a real failure. Still I agree that we should investigate this and make sure we test this in CI - we are working on that. |
No problem with having the test adapt but then we need to make sure that CI builds a separate variant with and without .pdb-s if we want to make that both work. |
|
Also the way test adapts cannot be self-referential (namely if there are no line numbers it means there are no .pdb-s) because then any bug affecting line number capture will be swallowed. The indication whether there are .pdb-s or there aren't should come from CI that knows which variant it built (for example via env var). |
Yeah, ok. I give up on this, won't do it... |
SergeyKleyman
left a comment
There was a problem hiding this comment.
I don't recommend merging this PR before understanding why some tests started to fail at CI.
In CI the HttpClient stacktrace did not contain any frames with lineno != 0. We moved the stack trace capturing from the Span .ctor to the End method, since then the stack from http methods did not contain line numbers - only in CI. Turned on extra logging which shows that `frame?.GetFileLineNumber()` returns 0 for every frame in that case, so nothing we can do something about. Therefore simplified the test for a single manual sync span, which still aligned with the purpose of the test: we only wanna make sure we capture line numbers in at least 1 scenario - in that case `frame?.GetFileLineNumber()` seems to return real line numbers. So we go with that test.
Codecov Report
@@ Coverage Diff @@
## master #374 +/- ##
==========================================
+ Coverage 78.32% 78.49% +0.17%
==========================================
Files 87 79 -8
Lines 3225 2637 -588
Branches 781 482 -299
==========================================
- Hits 2526 2070 -456
+ Misses 543 373 -170
- Partials 156 194 +38
Continue to review full report at Codecov.
|
It was a single test ( |
|
Please see #381 (comment) |
Remove archiveArtifacts for pdb files (was only used for debugging)
|
Planning to merge this (#381 with more details about the tests). @v1v I removed the debugging stuff from the |
Add better logging to also let users know that default is used
Solves #307
Added
SpanFramesMinDurationInMillisecondsandStackTraceLimit. Those work the same as in other agents.Main benefit:
The
StackTraceLimitis implemented by manually trimming the whole stack trace - so there is no perf difference between capturing only 1 stack frame vs. capturing all frames (see discussion here), nevertheless with this we are still in sync with other agents in terms of configs and with this PR we give the users the option to control stack frame collection and they can turn it off by setting those configs to0.