<regex>: Emit complete _N_str nodes only during NFA construction#6289
Open
muellerj3 wants to merge 2 commits into
Open
<regex>: Emit complete _N_str nodes only during NFA construction#6289muellerj3 wants to merge 2 commits into
<regex>: Emit complete _N_str nodes only during NFA construction#6289muellerj3 wants to merge 2 commits into
Conversation
Author
|
@microsoft-github-policy-service agree |
Contributor
There was a problem hiding this comment.
Pull request overview
Updates the <regex> NFA builder to defer creation of _N_str nodes until the builder must commit pending literal characters, preventing incorrect pooling of characters across group boundaries when _N_group nodes are no longer generated (per #5962 / GH-6289).
Changes:
- Reworked the regex NFA builder to buffer literal characters in
_Builder3::_Charsand only emit_N_strnodes at commit points during construction. - Renamed internal parser/builder types (
_Parser2/_Builder2→_Parser3/_Builder3) to avoid ABI breaks after layout changes, and simplified related method naming. - Added targeted regression tests covering anchors, word boundaries, groups, lookahead, alternation, and captures.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/std/tests/VSO_0000000_regex_use/test.cpp | Adds GH-6289 regression tests for string-node emission behavior across various regex constructs. |
| stl/inc/regex | Introduces _Builder3 buffering/emission logic for _N_str nodes and renames _Parser2/_Builder2 to _Parser3/_Builder3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| template <class _FwdIt, class _Elem, class _RxTraits> | ||
| _Node_base* _Builder2<_FwdIt, _Elem, _RxTraits>::_Begin_group() { // add group node | ||
| _Node_base* _Builder3<_FwdIt, _Elem, _RxTraits>::_Begin_group() { // add group node | ||
| _Emit_str_node(); |
Member
|
/azp run STL-CI |
|
Azure Pipelines successfully started running 1 pipeline(s). |
muellerj2
reviewed
May 18, 2026
| _Node_base* _Current; | ||
| regex_constants::syntax_option_type _Flags; | ||
| const _RxTraits& _Traits; | ||
| vector<_Elem> _Chars; |
There was a problem hiding this comment.
In hindsight, a string is the more appropriate container here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a side quest to #5962: If the parser no longer generates
_N_groupnodes, the current node in the NFA is the node just before the group when parsing of the group contents begins. This might be an_N_strnode and if the groups starts with some character sequence, the current logic will attach these characters to the_N_strnode just before the group. This would effectively move these characters outside of the non-capturing group without the_N_groupnode, thus miscompiling the regex.For this reason, this PR changes this character sequence pooling logic a bit: Characters now get added to a vector
_Charsinternal to the_Builder3class, until the following subexpression in the regex results in the attachment of some other kind of node to the NFA or the parser must remember the current position in the NFA. At that point, the builder will emit the string node and then append the other kind of node to the string node or return a pointer to the string node. This means that string nodes are no longer modified after they have been generated.Another nice bonus for the future: The parser does not really need the geometric growth logic inside the
_Bufclass anymore for_N_strnodes (although it is still used for now). I plan to rework the parsing logic for character classes as well and intend to remove all of that resize logic inside the_Bufclass at that time.The
_Builder2and_Parser2classes are renamed because the layout of the builder class changed, so these classes must be renamed to avoid an ABI break. The version numbers of_Parser3::_Alternative()and_Builder3::_Else_if()are removed because the class renaming renders them unnecessary.