Regex: reduce allocation slightly, add tests, code cleanup, add parser comments#30632
Conversation
| { | ||
| pattern = string.Create(pattern.Length, (pattern, culture), (span, state) => | ||
| { | ||
| // We do the ToLower character by character for consistency. With surrogate chars, doing |
There was a problem hiding this comment.
Is the issue described in this comment no longer relevant?
There was a problem hiding this comment.
I don't think the comment ever applied, I don't really get the part with the consistency and I prefer to operate on the whole string not just because of convenience but also because of integrity, like the surrogate pairs that the author mentioned here.
There was a problem hiding this comment.
Yes we have several tests that hit these code paths. Our test bed is fortunately very rich.
| @@ -43,15 +43,7 @@ public RegexBoyerMoore(string pattern, bool caseInsensitive, bool rightToLeft, C | |||
| if (caseInsensitive) | |||
| { | |||
| pattern = string.Create(pattern.Length, (pattern, culture), (span, state) => | |||
There was a problem hiding this comment.
You do not need to call string.Create at all here now. Just call pattern.ToLower(culture)
There was a problem hiding this comment.
right, missed that. thanks
There was a problem hiding this comment.
I was able to completely get rid of the call as it is already lowered by the RegexParser. I changed a few test cases to have more hits of the combination of a prefix and the IgnoreCase option and added an assert.
|
|
||
| private RegexOptions _options; | ||
| private List<RegexOptions> _optionsStack; | ||
| private RegexOptions _option; |
There was a problem hiding this comment.
Can this have more than one bit set? If it is the case, _options is a better name for it.
There was a problem hiding this comment.
yes it usually has multiple bits set. You are right, plural makes more sense then. I'll revert my change.
| RegexParser p; | ||
| RegexNode root; | ||
| Span<RegexOptions> optionSpan = stackalloc RegexOptions[OptionStackDefaultSize]; | ||
| var parser = new RegexParser(pattern, option, culture, optionSpan); |
There was a problem hiding this comment.
Is the Regex parsing recursive?
If it is, is there a problem with consuming a lot more stack than before and running into stackoverflow?
There was a problem hiding this comment.
Is the Regex parsing recursive?
The whole parsing logic happens in ScanRegex which then calls helper functions for different branches. There should not be any recursion involved in the parsing. Also we have a very good test coverage in S.T.RegularExpressions (> 90% code and branch coverage) and with #29178 we added a lot of tests cases for the RegexParser.
| { | ||
| ReadOnlySpan<char> input = state._pattern.AsSpan(pos, cch); | ||
|
|
||
| // We do the ToLower character by character for consistency. With surrogate chars, doing |
There was a problem hiding this comment.
Same concern as in the other place
There was a problem hiding this comment.
Yes we have several tests that hit these code paths. Our test bed is fortunately very rich.
I do not see any tests in the test bed that exercise surrogates that this comment is talking about.
There was a problem hiding this comment.
Because our engine operates on UTF16 code units and does not support surrogate pairs?
There was a problem hiding this comment.
There was a problem hiding this comment.
We should still have tests for it, even if it is documented as unsupported in the documentation. E.g. we should not be crashing.
The comment you are removing suggests that handling of surrogates showed up as a problem in the past and that somebody tried to make it better/more predictable at least.
There was a problem hiding this comment.
Right, there weren't any tests that exercised our misinterpretation. I added a positive and negative test for surrogate pairs. The positive one is working around the code unit problem and the negative one is expecting the ArgumentException because of the way interpret the input.
The comment you are removing suggests that handling of surrogates showed up as a problem in the past and that somebody tried to make it better/more predictable at least.
Right and that's why I believe we should handle the lowering of the (sub-)pattern in its entirety and not on a character basis. If we ever switch from our current UTF-16 handling to something else the result should still be valid. To check my sanity I just compared the results of two ToLower operations - one on an entire string and one per character - with surrogate values and the results are the identical, of course.
|
What is the |
|
I use the following test code https://gist.github.com/ViktorHofer/2b8338fc164ea105f79a456856ff43cb and run it locally with this script: |
| // (?... | ||
|
|
||
| // get the options if it's an option construct (?cimsx-cimsx...) | ||
| // get the options if it's an options construct (?cimsx-cimsx...) |
There was a problem hiding this comment.
an options - accidental find&replace ?
Prefix patterns which are passed to RegexBoyerMoore are already lowercased by the parser. Remove the redundant ToLower() call and assert the patterns lowercase state
|
@dotnet-bot test Linux arm64 Release Build |
| /// </summary> | ||
| public bool IsMatch(string text, int index, int beglimit, int endlimit) | ||
| { | ||
| // Length-cognizance optimization |
There was a problem hiding this comment.
Are you sure that this optimization is valid for case-insensitive mode?
There was a problem hiding this comment.
that optimization was already in place, I did not change it: http://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/RegexBoyerMoore.cs,257
but yes you could be right, I'm currently debugging the code paths, let me see...
There was a problem hiding this comment.
sorry I discovered that the code I removed is indeed used therefore I reverted my changes and your comment shouldn't apply anymore. I added tests to cover the hidden paths.
Add tests for rtl anchored patterns.
|
@dotnet-bot test Linux arm Release Build |
|
Any more feedback @jkotas? |
|
I do not have more feedback, however I have not reviewed this in detail. |
@ViktorHofer cool and light way!You could add this way at the botton of benchmarking with BDN doc. |
|
We don't recommend using this approach as it involves using internals (csc) and manually referencing the ref assemblies. That said, I documented it a few months ago under advanced inner loop in docs. |
Understood, however it's an "alternative" until BDN benchmarking will be fixed. Found doc thank's a lot! |
|
@danmosemsft @stephentoub do you mind reviewing / approving? |
| for (int i = 0; i < input.Length; i++) | ||
| span[i] = textInfo.ToLower(input[i]); | ||
| }); | ||
| state._pattern.AsSpan(pos, cch).ToLower(span, state._culture)); |
There was a problem hiding this comment.
This is capturing pos and cch. You need to use the one that's in state.
stephentoub
left a comment
There was a problem hiding this comment.
One comment that needs to be addressed, otherwise LGTM.
|
Thanks a lot! |
|
Is there a missing test, based on your last commit here |
|
No, that's an internal behavior that can't be captured by a test. Not using the locals from the transferred state didn't cause an error but probably has diminished the perf improvements. |
…r comments (dotnet/corefx#30632) * RegexParser & optionsstack ref * Add test coverage for Group.Synchronized * Adjust options mode test case * Add inline comment '#' test branch * Add comments * Replace manual ToLower calls by Span.ToLower * Make applicable fields readonly in parser * Change to Assert to reduce an if check in one branch * Code formatting * Avoid string allocation when IgnoreCase set Prefix patterns which are passed to RegexBoyerMoore are already lowercased by the parser. Remove the redundant ToLower() call and assert the patterns lowercase state * Add surrogate pair positive & negative tests * Add test cases for rtl anchor Commit migrated from dotnet/corefx@2f259de
#commentand other corner cases that weren't covered especially with right to left.Before (master)
After
Only minor improvements, I didn't expect anything severe. 5-6% less allocation, ~2-3% better perf.
Contributes towards https://github.com/dotnet/corefx/issues/30507