Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Regex: reduce allocation slightly, add tests, code cleanup, add parser comments#30632

Merged
ViktorHofer merged 17 commits into
dotnet:masterfrom
ViktorHofer:RegAlloc
Jun 29, 2018
Merged

Regex: reduce allocation slightly, add tests, code cleanup, add parser comments#30632
ViktorHofer merged 17 commits into
dotnet:masterfrom
ViktorHofer:RegAlloc

Conversation

@ViktorHofer
Copy link
Copy Markdown
Member

@ViktorHofer ViktorHofer commented Jun 24, 2018

  • Make RegexParser a ref struct and use ValueListBuilder with Span for the options stack instead of a List.
  • Add tests for Group.Synchronized & regex comments #comment and other corner cases that weren't covered especially with right to left.
  • Replace some method entry condition checks with Debug Asserts where applicable.
  • Code cleanup & simplify code in a few cases (readonly, ToLower on Span instead of single chars, variable declaration moved away from on top of method, some comments replaced with XML comments)
  • Add comments to better describe RegexParser behavior.

Before (master)

Method Mean Error StdDev Gen 0 Allocated
RegexCtor 101.8 ms 0.5352 ms 0.4179 ms 56500.0000 113.07 MB
RegexCtorIgnoreCase 98.15 ms 1.913 ms 2.278 ms 66312.5000 132.64 MB

After

Method Mean Error StdDev Gen 0 Allocated
RegexCtor 97.78 ms 0.5337 ms 0.4731 ms 53687.5000 107.44 MB
RegexCtorIgnoreCase 94.40 ms 1.030 ms 0.8600 ms 62312.5000 124.65 MB

Only minor improvements, I didn't expect anything severe. 5-6% less allocation, ~2-3% better perf.
Contributes towards https://github.com/dotnet/corefx/issues/30507

{
pattern = string.Create(pattern.Length, (pattern, culture), (span, state) =>
{
// We do the ToLower character by character for consistency. With surrogate chars, doing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue described in this comment no longer relevant?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test for it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment ever applied, I don't really get the part with the consistency and I prefer to operate on the whole string not just because of convenience but also because of integrity, like the surrogate pairs that the author mentioned here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have several tests that hit these code paths. Our test bed is fortunately very rich.

@@ -43,15 +43,7 @@ public RegexBoyerMoore(string pattern, bool caseInsensitive, bool rightToLeft, C
if (caseInsensitive)
{
pattern = string.Create(pattern.Length, (pattern, culture), (span, state) =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to call string.Create at all here now. Just call pattern.ToLower(culture)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, missed that. thanks

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to completely get rid of the call as it is already lowered by the RegexParser. I changed a few test cases to have more hits of the combination of a prefix and the IgnoreCase option and added an assert.


private RegexOptions _options;
private List<RegexOptions> _optionsStack;
private RegexOptions _option;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have more than one bit set? If it is the case, _options is a better name for it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it usually has multiple bits set. You are right, plural makes more sense then. I'll revert my change.

RegexParser p;
RegexNode root;
Span<RegexOptions> optionSpan = stackalloc RegexOptions[OptionStackDefaultSize];
var parser = new RegexParser(pattern, option, culture, optionSpan);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Regex parsing recursive?

If it is, is there a problem with consuming a lot more stack than before and running into stackoverflow?

Copy link
Copy Markdown
Member Author

@ViktorHofer ViktorHofer Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Regex parsing recursive?

The whole parsing logic happens in ScanRegex which then calls helper functions for different branches. There should not be any recursion involved in the parsing. Also we have a very good test coverage in S.T.RegularExpressions (> 90% code and branch coverage) and with #29178 we added a lot of tests cases for the RegexParser.

{
ReadOnlySpan<char> input = state._pattern.AsSpan(pos, cch);

// We do the ToLower character by character for consistency. With surrogate chars, doing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as in the other place

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have several tests that hit these code paths. Our test bed is fortunately very rich.

I do not see any tests in the test bed that exercise surrogates that this comment is talking about.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our engine operates on UTF16 code units and does not support surrogate pairs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@jkotas jkotas Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still have tests for it, even if it is documented as unsupported in the documentation. E.g. we should not be crashing.

The comment you are removing suggests that handling of surrogates showed up as a problem in the past and that somebody tried to make it better/more predictable at least.

Copy link
Copy Markdown
Member Author

@ViktorHofer ViktorHofer Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there weren't any tests that exercised our misinterpretation. I added a positive and negative test for surrogate pairs. The positive one is working around the code unit problem and the negative one is expecting the ArgumentException because of the way interpret the input.

The comment you are removing suggests that handling of surrogates showed up as a problem in the past and that somebody tried to make it better/more predictable at least.

Right and that's why I believe we should handle the lowering of the (sub-)pattern in its entirety and not on a character basis. If we ever switch from our current UTF-16 handling to something else the result should still be valid. To check my sanity I just compared the results of two ToLower operations - one on an entire string and one per character - with surrogate values and the results are the identical, of course.

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Jun 24, 2018

What is the RegexCtor benchmark doing? I am not able to find it in https://github.com/dotnet/corefx/tree/master/src/System.Text.RegularExpressions/tests/Performance

@ViktorHofer
Copy link
Copy Markdown
Member Author

ViktorHofer commented Jun 24, 2018

I use the following test code https://gist.github.com/ViktorHofer/2b8338fc164ea105f79a456856ff43cb and run it locally with this script:

.\CoreRun ..\..\..\tools\csc.dll /noconfig /optimize /r:System.Private.Corelib.dll /r:System.Runtime.dll /r:System.Runtime.Extensions.dll /r:System.Console.dll /r:System.Text.RegularExpressions.dll /r:System.Collections.dll /r:BenchmarkDotNet.dll /r:BenchmarkDotNet.Core.dll /out:regex-perf.dll regex-perf.cs
.\CoreRun regex-perf.dll

// (?...

// get the options if it's an option construct (?cimsx-cimsx...)
// get the options if it's an options construct (?cimsx-cimsx...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an options - accidental find&replace ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Prefix patterns which are passed to RegexBoyerMoore are already
lowercased by the parser. Remove the redundant ToLower() call and assert
the patterns lowercase state
@ViktorHofer
Copy link
Copy Markdown
Member Author

@dotnet-bot test Linux arm64 Release Build
@dotnet-bot test Linux x64 Release Build
@dotnet-bot test Linux-musl x64 Debug Build

/// </summary>
public bool IsMatch(string text, int index, int beglimit, int endlimit)
{
// Length-cognizance optimization
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this optimization is valid for case-insensitive mode?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that optimization was already in place, I did not change it: http://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/RegexBoyerMoore.cs,257

but yes you could be right, I'm currently debugging the code paths, let me see...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I discovered that the code I removed is indeed used therefore I reverted my changes and your comment shouldn't apply anymore. I added tests to cover the hidden paths.

Add tests for rtl anchored patterns.
@ViktorHofer
Copy link
Copy Markdown
Member Author

@dotnet-bot test Linux arm Release Build
@dotnet-bot test Tizen armel Debug Build

@danmoseley
Copy link
Copy Markdown
Member

Any more feedback @jkotas?

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Jun 27, 2018

I do not have more feedback, however I have not reviewed this in detail.

@MarcoRossignoli
Copy link
Copy Markdown
Member

I use the following test code https://gist.github.com/ViktorHofer/2b8338fc164ea105f79a456856ff43cb and run it locally with this script:
.\CoreRun ......\tools\csc.dll /noconfig /optimize /r:System.Private.Corelib.dll /r:System.Runtime.dll /r:System.Runtime.Extensions.dll /r:System.Console.dll /r:System.Text.RegularExpressions.dll /r:System.Collections.dll /r:BenchmarkDotNet.dll /r:BenchmarkDotNet.Core.dll /out:regex-perf.dll regex-perf.cs
.\CoreRun regex-perf.dll

@ViktorHofer cool and light way!You could add this way at the botton of benchmarking with BDN doc.

@ViktorHofer
Copy link
Copy Markdown
Member Author

We don't recommend using this approach as it involves using internals (csc) and manually referencing the ref assemblies. That said, I documented it a few months ago under advanced inner loop in docs.

@MarcoRossignoli
Copy link
Copy Markdown
Member

MarcoRossignoli commented Jun 28, 2018

We don't recommend using this approach as it involves using internals (csc) and manually referencing the ref assemblies. That said, I documented it a few months ago under advanced inner loop in docs.

Understood, however it's an "alternative" until BDN benchmarking will be fixed. Found doc thank's a lot!

@ViktorHofer
Copy link
Copy Markdown
Member Author

@danmosemsft @stephentoub do you mind reviewing / approving?

for (int i = 0; i < input.Length; i++)
span[i] = textInfo.ToLower(input[i]);
});
state._pattern.AsSpan(pos, cch).ToLower(span, state._culture));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is capturing pos and cch. You need to use the one that's in state.

Copy link
Copy Markdown
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment that needs to be addressed, otherwise LGTM.

@ViktorHofer
Copy link
Copy Markdown
Member Author

Thanks a lot!

@danmoseley
Copy link
Copy Markdown
Member

Is there a missing test, based on your last commit here

@ViktorHofer
Copy link
Copy Markdown
Member Author

ViktorHofer commented Jun 29, 2018

No, that's an internal behavior that can't be captured by a test. Not using the locals from the transferred state didn't cause an error but probably has diminished the perf improvements.

@ViktorHofer ViktorHofer merged commit 2f259de into dotnet:master Jun 29, 2018
@ViktorHofer ViktorHofer deleted the RegAlloc branch June 29, 2018 15:47
@karelz karelz added this to the 3.0 milestone Jul 8, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…r comments (dotnet/corefx#30632)

* RegexParser & optionsstack ref

* Add test coverage for Group.Synchronized

* Adjust options mode test case

* Add inline comment '#' test branch

* Add comments

* Replace manual ToLower calls by Span.ToLower

* Make applicable fields readonly in parser

* Change to Assert to reduce an if check in one branch

* Code formatting

* Avoid string allocation when IgnoreCase set

Prefix patterns which are passed to RegexBoyerMoore are already
lowercased by the parser. Remove the redundant ToLower() call and assert
the patterns lowercase state

* Add surrogate pair positive & negative tests

* Add test cases for rtl anchor

Commit migrated from dotnet/corefx@2f259de
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants