Add StackTraceLimit and SpanFramesMinDurationInMilliseconds configs by gregkalapos · Pull Request #374 · elastic/apm-agent-dotnet

gregkalapos · 2019-07-16T15:16:11Z

Solves #307

Added SpanFramesMinDurationInMilliseconds and StackTraceLimit. Those work the same as in other agents.

Main benefit:

By setting one of those setting to 0 users can disable stack trace collection, which causes perf. overhead. For numbers see this comment.

The StackTraceLimit is implemented by manually trimming the whole stack trace - so there is no perf difference between capturing only 1 stack frame vs. capturing all frames (see discussion here), nevertheless with this we are still in sync with other agents in terms of configs and with this PR we give the users the option to control stack frame collection and they can turn it off by setting those configs to 0.

Plus rename CapturedException.Stacktrace to StackTrace to be in sync with Span.StackTrace

gregkalapos · 2019-07-16T17:55:58Z

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.

Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

SergeyKleyman

LGTM other than a few minor issues and one major one - possible performance issue with parsing configuration each check if stack trace should be taken.

src/Elastic.Apm/Config/AbstractConfigurationReader.cs

test/Elastic.Apm.Tests/StackTraceTests.cs

SergeyKleyman · 2019-07-16T18:10:58Z

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.

Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines?

test/Elastic.Apm.Tests/StackTraceTests.cs

gregkalapos · 2019-07-16T19:56:03Z

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.
Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines?

I merged something today, and with that all tests were green: 0b66a32 (only thing made it red is that the test coverage is below target, but that's a known bug and the next build will make it good again.) - I think there is no change between this branch and master in terms of jenkins and CI related things.

And spanFramesMinDurationInMilliseconds -1 for old tests to not change behaviour.

gregkalapos · 2019-07-16T21:02:33Z

The StackFrame class has a very promising feature - see here

With that constructor it's possible to create single stack frames instead of getting the whole stack trace with new StackTrace(true).GetFrames().

Which sounds very promising, since e.g. if we only need 10 frames we could just create the 10 StackFrame instances.

So, something like this in Span.End():

if (_configurationReader.StackTraceLimit > 0)
{
        //not complete - but shows the point...
	var stackFrames = new StackFrame[_configurationReader.StackTraceLimit];
	for (var i = 0; i < _configurationReader.StackTraceLimit; i++)
	{
		stackFrames[i] = new StackFrame(i);
	}
	StackTrace = StacktraceHelper.GenerateApmStackTrace(stackFrames, _logger,
		_configurationReader, $"Span `{Name}'");
}
else
{
	StackTrace = StacktraceHelper.GenerateApmStackTrace(new StackTrace(true).GetFrames(), _logger,
		_configurationReader, $"Span `{Name}'");
}

I implemented this, but it does not seem to help.

Test case: 100 Transactions each with 10Spans, spanFramesMinDurationInMilliseconds: "-1ms", stackTraceLimit: "10".

Original code (no manual StackFrame creation, only new StackTrace(true).GetFrames():

|                      Method |     Mean |    Error |   StdDev | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|---------------------------- |---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
| Simple100Transaction10Spans | 65.61 ms | 1.300 ms | 1.216 ms |   5375.0000 |   1250.0000 |           - |            32.95 MB |

Adding this "optimization":

|                      Method |     Mean |    Error |   StdDev | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|---------------------------- |---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
| Simple100Transaction10Spans | 132.0 ms | 1.231 ms | 1.092 ms |   6500.0000 |   1000.0000 |           - |             40.1 MB |

Clearly no improvement, won't commit 😔

gregkalapos · 2019-07-17T11:16:54Z

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.
Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines?

I merged something today, and with that all tests were green: 0b66a32 (only thing made it red is that the test coverage is below target, but that's a known bug and the next build will make it good again.) - I think there is no change between this branch and master in terms of jenkins and CI related things.

Sooo.. I think the .pdbs are simply missing. Also that test is a little bit unfortunate, because it assumes that .pdbs are present, but that does not necessarily have to be the case.

I'm thinking about dropping that test, since it causes more trouble than added value.

SergeyKleyman · 2019-07-17T13:42:16Z

Sooo.. I think the .pdbs are simply missing. Also that test is a little bit unfortunate, because it assumes that .pdbs are present, but that does not necessarily have to be the case.

I'm thinking about dropping that test, since it causes more trouble than added value.

The question is why .pdbs are missing only after changes in this PR?

I would not recommend removing tests before understanding why they fail - to me it contradicts the whole purpose of the tests. Also I would not recommend deferring investigation to a separate issue since it's very risky to merge this PR before understanding why tests fail. But it's your decision to make.

gregkalapos · 2019-07-17T13:44:42Z

Sooo.. I think the .pdbs are simply missing. Also that test is a little bit unfortunate, because it assumes that .pdbs are present, but that does not necessarily have to be the case.
I'm thinking about dropping that test, since it causes more trouble than added value.

The question is why .pdbs are missing only after changes in this PR?

I would not recommend removing tests before understanding why they fail - to me it contradicts the whole purpose of the tests. Also I would not recommend deferring investigation to a separate issue since it's very risky to merge this PR before understanding why tests fail. But it's your decision to make.

Already talking to @v1v. We try to figure this out.

Nevertheless the test skipped if there are no .pdb files - that's also the prod. behaviour - if there is no pdb we can't show line numbers, so the test is only valid if there are .pdbs.

SergeyKleyman · 2019-07-17T13:57:12Z

src/Elastic.Apm/Config/AbstractConfigurationReader.cs

-						default: throw new ArgumentException( "Unexpected TimeSuffix value", nameof(defaultSuffix));
+						default:
+							valueInMilliseconds = -1;
+							return false;


Are you sure you want to just swallow incorrect input (which most likely means a bug in agents code)? For example TryParse(String, NumberStyles, IFormatProvider, Int32)

ArgumentException
style is not a NumberStyles value.

I reverted it, but if we throw an ArgumentException then we have to handle it - not being able to parse these values should not crash the agent even if it's an agent bug (I hope we agree at least on this).

If that's the case then we should handle it in the caller method, which is basically the same as when the method returns false - so from a single case, which is not being able to parse and returning false we made 2 cases: 1) returning false, 2) exception. Now we have to handle both, but the handling logic is the same.

Here is the commit: cddbba5 - I think this code is very noise and repeats the same.

Nevertheless I think this is not that important - let's keep it this way.

It's not the same - default units passed as a separate parameter cannot be affected by user's input that means if the units are invalid it's a bug in agent's code (or maybe some mix of assemblies from different incompatible versions, etc.) Unlike invalid configuration option value which is just an error in logging terms, the bug with invalid units is a critical event. I think in case of a critical event there's no recovery and fail-fast approach should be used which for application usually means - record the current state to help with investigation and exit as soon as possible to prevent further possibly much more disastrous business data corruption. For the agent the second part is trickier - we can consider if killing the whole application might be an option users might want to have but at the very least agent should definitely stop doing ANYTHING.

not being able to parse these values should not crash the agent even if it's an agent bug (I hope we agree at least on this).

Ok, we don't even agree on that one. 🤷‍♂️

I think it makes sense to not spend too much on it. I'll do whichever you want. Let me know if you want further changes.

I think you just misread my note - I'm not talking about user's input which of course can contain invalid units, I'm talking about parameters passed by agent's code which should be valid unless there is a bug.

Yes, I know that. I just think we overcomplicate things.

Adapting code to your suggestion.

Regarding the way to handle - I think logging critical event is okay for now. I added the link to this discussion to #208 which I think should include minimal infrastructure to handle agent's bugs (fail-fast approach or maybe something else). We can discuss it when we get to #208 (not 1.0)

SergeyKleyman · 2019-07-17T14:00:36Z

Nevertheless the test skipped if there are no .pdb files - that's also the prod. behaviour - if there is no pdb we can't show line numbers, so the test is only valid if there are .pdbs.

I'm not sure I understand what you mean - if there are no line numbers at prod then why do we need all the agent production code and the tests dealing with the case when there are line numbers available?

…m-agent-dotnet into ControlStackTrace

gregkalapos · 2019-07-17T14:16:19Z

Nevertheless the test skipped if there are no .pdb files - that's also the prod. behaviour - if there is no pdb we can't show line numbers, so the test is only valid if there are .pdbs.

I'm not sure I understand what you mean - if there are no line numbers at prod then why do we need all the agent production code and the tests dealing with the case when there are line numbers available?

Maybe I wrote too quickly.

So, if there are no pdb files, even with new StackTrace(true) we won't get this line number information.
So there are 2 cases:

pdbs are available for the application binaries: in this case we deliver line numbers and I think we should test this.
no pdbs available: in this case we won't be able to deliver line numbers.

My suggestion is that the tests also adapts to this and it will be skipped when there are no pdbs, instead of failing. The reasoning is that without pdbs making the tests fail would be false positive, since there is no way to make the feature work without the pdbs... that's not a real failure.

Still I agree that we should investigate this and make sure we test this in CI - we are working on that.

SergeyKleyman · 2019-07-17T14:30:24Z

My suggestion is that the tests also adapts to this and it will be skipped when there are no pdbs, instead of failing. The reasoning is that without pdbs making the tests fail would be false positive, since there is no way to make the feature work without the pdbs... that's not a real failure.

No problem with having the test adapt but then we need to make sure that CI builds a separate variant with and without .pdb-s if we want to make that both work.

SergeyKleyman · 2019-07-17T14:34:03Z

Also the way test adapts cannot be self-referential (namely if there are no line numbers it means there are no .pdb-s) because then any bug affecting line number capture will be swallowed. The indication whether there are .pdb-s or there aren't should come from CI that knows which variant it built (for example via env var).

gregkalapos · 2019-07-17T14:37:30Z

Also the way test adapts cannot self-referential (namely if there are no line numbers it means there are no .pdb-s) because then any bug affecting line number capture will be swallowed. The indication whether there are .pdb-s or there aren't should come from CI that knows which variant it built (for example via env var).

Yeah, ok. I give up on this, won't do it...

…ption

SergeyKleyman

I don't recommend merging this PR before understanding why some tests started to fail at CI.

In CI the HttpClient stacktrace did not contain any frames with lineno != 0. We moved the stack trace capturing from the Span .ctor to the End method, since then the stack from http methods did not contain line numbers - only in CI. Turned on extra logging which shows that `frame?.GetFileLineNumber()` returns 0 for every frame in that case, so nothing we can do something about. Therefore simplified the test for a single manual sync span, which still aligned with the purpose of the test: we only wanna make sure we capture line numbers in at least 1 scenario - in that case `frame?.GetFileLineNumber()` seems to return real line numbers. So we go with that test.

codecov-io · 2019-07-18T04:31:37Z

Codecov Report

Merging #374 into master will increase coverage by 0.17%.
The diff coverage is 80.41%.

@@            Coverage Diff             @@
##           master     #374      +/-   ##
==========================================
+ Coverage   78.32%   78.49%   +0.17%     
==========================================
  Files          87       79       -8     
  Lines        3225     2637     -588     
  Branches      781      482     -299     
==========================================
- Hits         2526     2070     -456     
+ Misses        543      373     -170     
- Partials      156      194      +38

Impacted Files	Coverage Δ
src/Elastic.Apm/Config/ConfigConsts.cs	`100% <ø> (ø)`	⬆️
src/Elastic.Apm/Logging/IApmLoggingExtensions.cs	`87.5% <0%> (+0.83%)`	⬆️
src/Elastic.Apm/Model/Transaction.cs	`96.55% <100%> (-2.37%)`	⬇️
src/Elastic.Apm/Model/CapturedException.cs	`33.33% <100%> (-6.67%)`	⬇️
src/Elastic.Apm/Model/Span.cs	`96.21% <100%> (-1.53%)`	⬇️
src/Elastic.Apm/Api/Tracer.cs	`95.83% <100%> (-3.48%)`	⬇️
src/Elastic.Apm/Metrics/MetricsCollector.cs	`77.27% <100%> (ø)`	⬆️
...Apm.AspNetCore/Config/MicrosoftExtensionsConfig.cs	`77.5% <100%> (-5.84%)`	⬇️
src/Elastic.Apm/Model/ExecutionSegmentCommon.cs	`95.65% <100%> (-0.6%)`	⬇️
src/Elastic.Apm/Helpers/StacktraceHelper.cs	`67.18% <65.21%> (-8.95%)`	⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b66a32...d8f761a. Read the comment docs.

gregkalapos · 2019-07-18T04:37:03Z

I don't recommend merging this PR before understanding why some tests started to fail at CI.

It was a single test (HttpClientStackTrace). Fixed now 🎉 , investigation with some details on #381.

SergeyKleyman · 2019-07-18T07:04:27Z

Please see #381 (comment)

Remove archiveArtifacts for pdb files (was only used for debugging)

gregkalapos · 2019-07-18T14:18:41Z

Planning to merge this (#381 with more details about the tests).

@v1v I removed the debugging stuff from the jenkinsfile - I left the 'find . -name *.pdb' I think that can be generally useful, the other things were just to archive the pdbs, which gave me the info I needed. Added you as reviewer, could you pls have a quick look at the jenkinsfile and make sure we are ok? Thx!

Add better logging to also let users know that default is used

gregkalapos added 8 commits July 10, 2019 22:15

Add StackTraceLimit and SpanFramesMinDuration configs

66d30d6

Add StackTraceLimit and SpanFramesMinDurationInMilliseconds tests

81837b0

Implement stackTraceLimit - wip #307

7f9f3b5

Implement SpanFramesMinDurationInMilliseconds & tests

7a2b931

Add test default StackTraceLimit and SpanFramesMinDuration

3e10cd6

Fix MicrosoftExtensionsConfig.SpanFramesMinDurationInMilliseconds

fdfde72

Plus rename CapturedException.Stacktrace to StackTrace to be in sync with Span.StackTrace

Run code cleanup

191e936

Update test/Elastic.Apm.Tests/ConfigTests.cs

1f8f24e

gregkalapos requested a review from SergeyKleyman July 16, 2019 15:16

gregkalapos self-assigned this Jul 16, 2019

gregkalapos added 2 commits July 16, 2019 17:38

Update test/Elastic.Apm.Tests/HttpDiagnosticListenerTest.cs

fb660fa

Update test/Elastic.Apm.Tests/StackTraceTests.cs

69d196e

SergeyKleyman reviewed Jul 16, 2019

View reviewed changes

test/Elastic.Apm.Tests/StackTraceTests.cs Outdated Show resolved Hide resolved

NoInlining in RecursiveCall100XAndThrow

85b9ae0

And spanFramesMinDurationInMilliseconds -1 for old tests to not change behaviour.

gregkalapos added 3 commits July 17, 2019 00:41

Address review feedback

0f9ef84

Update test/Elastic.Apm.Tests/StackTraceTests.cs

2c6db21

Update test/Elastic.Apm.Tests/StackTraceTests.cs

51bbdc6

gregkalapos added the [zube]: In Review label Jul 17, 2019

gregkalapos mentioned this pull request Jul 17, 2019

Implement stack_trace_limit setting #225

Closed

Make LogLevel in TryParseLogLevel non nullable

56fade0

gregkalapos mentioned this pull request Jul 17, 2019

Investigate failing HttpClientStackTrace in CI #381

Closed

Update src/Elastic.Apm/Config/AbstractConfigurationReader.cs

ee3e20c

SergeyKleyman reviewed Jul 17, 2019

View reviewed changes

gregkalapos added 2 commits July 17, 2019 16:05

Update src/Elastic.Apm/Config/AbstractConfigurationReader.cs

cddbba5

Merge branch 'ControlStackTrace' of https://github.com/gregkalapos/ap…

be928fa

…m-agent-dotnet into ControlStackTrace

Make AbstractConfigurationReader log critical in case of ArgumentExce…

7284123

…ption

SergeyKleyman mentioned this pull request Jul 17, 2019

Exceptions in Agent's code should not escape to monitored application code #208

Closed

Merge remote-tracking branch 'upstream/master' into ControlStackTrace

4bf01b5

SergeyKleyman approved these changes Jul 17, 2019

View reviewed changes

gregkalapos added 2 commits July 17, 2019 18:36

Debug failing test

aca94f5

Update Jenkinsfile

f613569

Remove archiveArtifacts for pdb files (was only used for debugging)

gregkalapos requested a review from v1v July 18, 2019 14:16

v1v approved these changes Jul 18, 2019

View reviewed changes

gregkalapos added 3 commits July 18, 2019 16:28

Update src/Elastic.Apm/Config/AbstractConfigurationReader.cs

c7c429c

Add better logging to also let users know that default is used

Update src/Elastic.Apm/Config/AbstractConfigurationReader.cs

f08254f

Merge branch 'master' into ControlStackTrace

5894aec

gregkalapos merged commit a951cbd into elastic:master Jul 18, 2019

zube bot added [zube]: Done and removed [zube]: In Review labels Jul 18, 2019

This was referenced Jul 18, 2019

Add configuration setting to control whether a stack trace should be captured for each span #307

Closed

Capture request body aspnet core #402

Merged

Add documentation about stack trace configs #403

Merged

Conversation

gregkalapos commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregkalapos commented Jul 16, 2019

Uh oh!

SergeyKleyman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SergeyKleyman commented Jul 16, 2019

Uh oh!

Uh oh!

gregkalapos commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregkalapos commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregkalapos commented Jul 17, 2019

Uh oh!

SergeyKleyman commented Jul 17, 2019

Uh oh!

gregkalapos commented Jul 17, 2019

Uh oh!

SergeyKleyman Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

gregkalapos Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

SergeyKleyman Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregkalapos Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

SergeyKleyman Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

gregkalapos Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

SergeyKleyman Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

SergeyKleyman commented Jul 17, 2019

Uh oh!

gregkalapos commented Jul 17, 2019

Uh oh!

SergeyKleyman commented Jul 17, 2019

Uh oh!

SergeyKleyman commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregkalapos commented Jul 17, 2019

Uh oh!

SergeyKleyman left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jul 18, 2019

Codecov Report

Uh oh!

gregkalapos commented Jul 18, 2019

Uh oh!

SergeyKleyman commented Jul 18, 2019

Uh oh!

gregkalapos commented Jul 18, 2019

Uh oh!

Reviewers

gregkalapos commented Jul 16, 2019 •

edited

Loading

gregkalapos commented Jul 16, 2019 •

edited

Loading

gregkalapos commented Jul 16, 2019 •

edited

Loading

SergeyKleyman Jul 17, 2019 •

edited

Loading

SergeyKleyman commented Jul 17, 2019 •

edited

Loading