Skip to content

Add StackTraceLimit and SpanFramesMinDurationInMilliseconds configs#374

Merged
gregkalapos merged 28 commits intoelastic:masterfrom
gregkalapos:ControlStackTrace
Jul 18, 2019
Merged

Add StackTraceLimit and SpanFramesMinDurationInMilliseconds configs#374
gregkalapos merged 28 commits intoelastic:masterfrom
gregkalapos:ControlStackTrace

Conversation

@gregkalapos
Copy link
Copy Markdown
Contributor

@gregkalapos gregkalapos commented Jul 16, 2019

Solves #307

Added SpanFramesMinDurationInMilliseconds and StackTraceLimit. Those work the same as in other agents.

Main benefit:

  • By setting one of those setting to 0 users can disable stack trace collection, which causes perf. overhead. For numbers see this comment.

The StackTraceLimit is implemented by manually trimming the whole stack trace - so there is no perf difference between capturing only 1 stack frame vs. capturing all frames (see discussion here), nevertheless with this we are still in sync with other agents in terms of configs and with this PR we give the users the option to control stack frame collection and they can turn it off by setting those configs to 0.

@gregkalapos gregkalapos self-assigned this Jul 16, 2019
@gregkalapos
Copy link
Copy Markdown
Contributor Author

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.

Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

Copy link
Copy Markdown
Contributor

@SergeyKleyman SergeyKleyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than a few minor issues and one major one - possible performance issue with parsing configuration each check if stack trace should be taken.

@SergeyKleyman
Copy link
Copy Markdown
Contributor

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.

Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines?

@gregkalapos
Copy link
Copy Markdown
Contributor Author

gregkalapos commented Jul 16, 2019

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.
Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines?

I merged something today, and with that all tests were green: 0b66a32 (only thing made it red is that the test coverage is below target, but that's a known bug and the next build will make it good again.) - I think there is no change between this branch and master in terms of jenkins and CI related things.

And spanFramesMinDurationInMilliseconds -1 for old tests to not change behaviour.
@gregkalapos
Copy link
Copy Markdown
Contributor Author

gregkalapos commented Jul 16, 2019

The StackFrame class has a very promising feature - see here

With that constructor it's possible to create single stack frames instead of getting the whole stack trace with new StackTrace(true).GetFrames().

Which sounds very promising, since e.g. if we only need 10 frames we could just create the 10 StackFrame instances.

So, something like this in Span.End():

if (_configurationReader.StackTraceLimit > 0)
{
        //not complete - but shows the point...
	var stackFrames = new StackFrame[_configurationReader.StackTraceLimit];
	for (var i = 0; i < _configurationReader.StackTraceLimit; i++)
	{
		stackFrames[i] = new StackFrame(i);
	}
	StackTrace = StacktraceHelper.GenerateApmStackTrace(stackFrames, _logger,
		_configurationReader, $"Span `{Name}'");
}
else
{
	StackTrace = StacktraceHelper.GenerateApmStackTrace(new StackTrace(true).GetFrames(), _logger,
		_configurationReader, $"Span `{Name}'");
}

I implemented this, but it does not seem to help.

Test case: 100 Transactions each with 10Spans, spanFramesMinDurationInMilliseconds: "-1ms", stackTraceLimit: "10".

Original code (no manual StackFrame creation, only new StackTrace(true).GetFrames():

|                      Method |     Mean |    Error |   StdDev | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|---------------------------- |---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
| Simple100Transaction10Spans | 65.61 ms | 1.300 ms | 1.216 ms |   5375.0000 |   1250.0000 |           - |            32.95 MB |

Adding this "optimization":

|                      Method |     Mean |    Error |   StdDev | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|---------------------------- |---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
| Simple100Transaction10Spans | 132.0 ms | 1.231 ms | 1.092 ms |   6500.0000 |   1000.0000 |           - |             40.1 MB |

Clearly no improvement, won't commit 😔

@gregkalapos
Copy link
Copy Markdown
Contributor Author

Fighting with CI - it seems that in CI we don't get line number info and from that reason we have 1 failing test.
Can't repro locally - trying to figure this out - I don't see yet how this PR changes anything on that part 🤷‍♂️

I see that the test fails only on Linux - is it possible that on Linux assemblies don't have debugging info with file names and line numbers? It's not clear why this started to happen with this PR but maybe CI machines when through some changes? What would be the simplest way to re-run tests on the current master on CI machines?

I merged something today, and with that all tests were green: 0b66a32 (only thing made it red is that the test coverage is below target, but that's a known bug and the next build will make it good again.) - I think there is no change between this branch and master in terms of jenkins and CI related things.

Sooo.. I think the .pdbs are simply missing. Also that test is a little bit unfortunate, because it assumes that .pdbs are present, but that does not necessarily have to be the case.

I'm thinking about dropping that test, since it causes more trouble than added value.

@SergeyKleyman
Copy link
Copy Markdown
Contributor

Sooo.. I think the .pdbs are simply missing. Also that test is a little bit unfortunate, because it assumes that .pdbs are present, but that does not necessarily have to be the case.

I'm thinking about dropping that test, since it causes more trouble than added value.

The question is why .pdbs are missing only after changes in this PR?

I would not recommend removing tests before understanding why they fail - to me it contradicts the whole purpose of the tests. Also I would not recommend deferring investigation to a separate issue since it's very risky to merge this PR before understanding why tests fail. But it's your decision to make.

@gregkalapos
Copy link
Copy Markdown
Contributor Author

Sooo.. I think the .pdbs are simply missing. Also that test is a little bit unfortunate, because it assumes that .pdbs are present, but that does not necessarily have to be the case.
I'm thinking about dropping that test, since it causes more trouble than added value.

The question is why .pdbs are missing only after changes in this PR?

I would not recommend removing tests before understanding why they fail - to me it contradicts the whole purpose of the tests. Also I would not recommend deferring investigation to a separate issue since it's very risky to merge this PR before understanding why tests fail. But it's your decision to make.

Already talking to @v1v. We try to figure this out.

Nevertheless the test skipped if there are no .pdb files - that's also the prod. behaviour - if there is no pdb we can't show line numbers, so the test is only valid if there are .pdbs.

default: throw new ArgumentException( "Unexpected TimeSuffix value", nameof(defaultSuffix));
default:
valueInMilliseconds = -1;
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to just swallow incorrect input (which most likely means a bug in agents code)? For example TryParse(String, NumberStyles, IFormatProvider, Int32)

ArgumentException
style is not a NumberStyles value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it, but if we throw an ArgumentException then we have to handle it - not being able to parse these values should not crash the agent even if it's an agent bug (I hope we agree at least on this).

If that's the case then we should handle it in the caller method, which is basically the same as when the method returns false - so from a single case, which is not being able to parse and returning false we made 2 cases: 1) returning false, 2) exception. Now we have to handle both, but the handling logic is the same.

Here is the commit: cddbba5 - I think this code is very noise and repeats the same.

Nevertheless I think this is not that important - let's keep it this way.

Copy link
Copy Markdown
Contributor

@SergeyKleyman SergeyKleyman Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same - default units passed as a separate parameter cannot be affected by user's input that means if the units are invalid it's a bug in agent's code (or maybe some mix of assemblies from different incompatible versions, etc.) Unlike invalid configuration option value which is just an error in logging terms, the bug with invalid units is a critical event. I think in case of a critical event there's no recovery and fail-fast approach should be used which for application usually means - record the current state to help with investigation and exit as soon as possible to prevent further possibly much more disastrous business data corruption. For the agent the second part is trickier - we can consider if killing the whole application might be an option users might want to have but at the very least agent should definitely stop doing ANYTHING.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not being able to parse these values should not crash the agent even if it's an agent bug (I hope we agree at least on this).

Ok, we don't even agree on that one. 🤷‍♂️

I think it makes sense to not spend too much on it. I'll do whichever you want. Let me know if you want further changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just misread my note - I'm not talking about user's input which of course can contain invalid units, I'm talking about parameters passed by agent's code which should be valid unless there is a bug.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know that. I just think we overcomplicate things.

Adapting code to your suggestion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the way to handle - I think logging critical event is okay for now. I added the link to this discussion to #208 which I think should include minimal infrastructure to handle agent's bugs (fail-fast approach or maybe something else). We can discuss it when we get to #208 (not 1.0)

@SergeyKleyman
Copy link
Copy Markdown
Contributor

Nevertheless the test skipped if there are no .pdb files - that's also the prod. behaviour - if there is no pdb we can't show line numbers, so the test is only valid if there are .pdbs.

I'm not sure I understand what you mean - if there are no line numbers at prod then why do we need all the agent production code and the tests dealing with the case when there are line numbers available?

@gregkalapos
Copy link
Copy Markdown
Contributor Author

Nevertheless the test skipped if there are no .pdb files - that's also the prod. behaviour - if there is no pdb we can't show line numbers, so the test is only valid if there are .pdbs.

I'm not sure I understand what you mean - if there are no line numbers at prod then why do we need all the agent production code and the tests dealing with the case when there are line numbers available?

Maybe I wrote too quickly.

So, if there are no pdb files, even with new StackTrace(true) we won't get this line number information.
So there are 2 cases:

  • pdbs are available for the application binaries: in this case we deliver line numbers and I think we should test this.
  • no pdbs available: in this case we won't be able to deliver line numbers.

My suggestion is that the tests also adapts to this and it will be skipped when there are no pdbs, instead of failing. The reasoning is that without pdbs making the tests fail would be false positive, since there is no way to make the feature work without the pdbs... that's not a real failure.

Still I agree that we should investigate this and make sure we test this in CI - we are working on that.

@SergeyKleyman
Copy link
Copy Markdown
Contributor

My suggestion is that the tests also adapts to this and it will be skipped when there are no pdbs, instead of failing. The reasoning is that without pdbs making the tests fail would be false positive, since there is no way to make the feature work without the pdbs... that's not a real failure.

No problem with having the test adapt but then we need to make sure that CI builds a separate variant with and without .pdb-s if we want to make that both work.

@SergeyKleyman
Copy link
Copy Markdown
Contributor

SergeyKleyman commented Jul 17, 2019

Also the way test adapts cannot be self-referential (namely if there are no line numbers it means there are no .pdb-s) because then any bug affecting line number capture will be swallowed. The indication whether there are .pdb-s or there aren't should come from CI that knows which variant it built (for example via env var).

@gregkalapos
Copy link
Copy Markdown
Contributor Author

Also the way test adapts cannot self-referential (namely if there are no line numbers it means there are no .pdb-s) because then any bug affecting line number capture will be swallowed. The indication whether there are .pdb-s or there aren't should come from CI that knows which variant it built (for example via env var).

Yeah, ok. I give up on this, won't do it...

Copy link
Copy Markdown
Contributor

@SergeyKleyman SergeyKleyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recommend merging this PR before understanding why some tests started to fail at CI.

In CI the HttpClient stacktrace did not contain any frames with lineno != 0. We moved the stack trace capturing from the Span .ctor to the End method, since then the stack from http methods did not contain line numbers - only in CI. Turned on extra logging which shows that `frame?.GetFileLineNumber()` returns 0 for every frame in that case, so nothing we can do something about. Therefore simplified the test for a single manual sync span, which still aligned with the purpose of the test: we only wanna make sure we capture line numbers in at least 1 scenario - in that case `frame?.GetFileLineNumber()` seems to return real line numbers. So we go with that test.
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #374 into master will increase coverage by 0.17%.
The diff coverage is 80.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #374      +/-   ##
==========================================
+ Coverage   78.32%   78.49%   +0.17%     
==========================================
  Files          87       79       -8     
  Lines        3225     2637     -588     
  Branches      781      482     -299     
==========================================
- Hits         2526     2070     -456     
+ Misses        543      373     -170     
- Partials      156      194      +38
Impacted Files Coverage Δ
src/Elastic.Apm/Config/ConfigConsts.cs 100% <ø> (ø) ⬆️
src/Elastic.Apm/Logging/IApmLoggingExtensions.cs 87.5% <0%> (+0.83%) ⬆️
src/Elastic.Apm/Model/Transaction.cs 96.55% <100%> (-2.37%) ⬇️
src/Elastic.Apm/Model/CapturedException.cs 33.33% <100%> (-6.67%) ⬇️
src/Elastic.Apm/Model/Span.cs 96.21% <100%> (-1.53%) ⬇️
src/Elastic.Apm/Api/Tracer.cs 95.83% <100%> (-3.48%) ⬇️
src/Elastic.Apm/Metrics/MetricsCollector.cs 77.27% <100%> (ø) ⬆️
...Apm.AspNetCore/Config/MicrosoftExtensionsConfig.cs 77.5% <100%> (-5.84%) ⬇️
src/Elastic.Apm/Model/ExecutionSegmentCommon.cs 95.65% <100%> (-0.6%) ⬇️
src/Elastic.Apm/Helpers/StacktraceHelper.cs 67.18% <65.21%> (-8.95%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b66a32...d8f761a. Read the comment docs.

@gregkalapos
Copy link
Copy Markdown
Contributor Author

I don't recommend merging this PR before understanding why some tests started to fail at CI.

It was a single test (HttpClientStackTrace). Fixed now 🎉 , investigation with some details on #381.

@SergeyKleyman
Copy link
Copy Markdown
Contributor

Please see #381 (comment)

Remove archiveArtifacts for pdb files (was only used for debugging)
@gregkalapos gregkalapos requested a review from v1v July 18, 2019 14:16
@gregkalapos
Copy link
Copy Markdown
Contributor Author

Planning to merge this (#381 with more details about the tests).

@v1v I removed the debugging stuff from the jenkinsfile - I left the 'find . -name *.pdb' I think that can be generally useful, the other things were just to archive the pdbs, which gave me the info I needed. Added you as reviewer, could you pls have a quick look at the jenkinsfile and make sure we are ok? Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants