Skip to content

<regex>: Emit complete _N_str nodes only during NFA construction#6289

Open
muellerj3 wants to merge 2 commits into
microsoft:mainfrom
muellerj2:regex-emit-complete-string-nodes-only
Open

<regex>: Emit complete _N_str nodes only during NFA construction#6289
muellerj3 wants to merge 2 commits into
microsoft:mainfrom
muellerj2:regex-emit-complete-string-nodes-only

Conversation

@muellerj3
Copy link
Copy Markdown

This is a side quest to #5962: If the parser no longer generates _N_group nodes, the current node in the NFA is the node just before the group when parsing of the group contents begins. This might be an _N_str node and if the groups starts with some character sequence, the current logic will attach these characters to the _N_str node just before the group. This would effectively move these characters outside of the non-capturing group without the _N_group node, thus miscompiling the regex.

For this reason, this PR changes this character sequence pooling logic a bit: Characters now get added to a vector _Chars internal to the _Builder3 class, until the following subexpression in the regex results in the attachment of some other kind of node to the NFA or the parser must remember the current position in the NFA. At that point, the builder will emit the string node and then append the other kind of node to the string node or return a pointer to the string node. This means that string nodes are no longer modified after they have been generated.

Another nice bonus for the future: The parser does not really need the geometric growth logic inside the _Buf class anymore for _N_str nodes (although it is still used for now). I plan to rework the parsing logic for character classes as well and intend to remove all of that resize logic inside the _Buf class at that time.

The _Builder2 and _Parser2 classes are renamed because the layout of the builder class changed, so these classes must be renamed to avoid an ABI break. The version numbers of _Parser3::_Alternative() and _Builder3::_Else_if() are removed because the class renaming renders them unnecessary.

Copilot AI review requested due to automatic review settings May 17, 2026 20:48
@muellerj3 muellerj3 requested a review from a team as a code owner May 17, 2026 20:48
@github-project-automation github-project-automation Bot moved this to Initial Review in STL Code Reviews May 17, 2026
@muellerj3
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the <regex> NFA builder to defer creation of _N_str nodes until the builder must commit pending literal characters, preventing incorrect pooling of characters across group boundaries when _N_group nodes are no longer generated (per #5962 / GH-6289).

Changes:

  • Reworked the regex NFA builder to buffer literal characters in _Builder3::_Chars and only emit _N_str nodes at commit points during construction.
  • Renamed internal parser/builder types (_Parser2/_Builder2_Parser3/_Builder3) to avoid ABI breaks after layout changes, and simplified related method naming.
  • Added targeted regression tests covering anchors, word boundaries, groups, lookahead, alternation, and captures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/std/tests/VSO_0000000_regex_use/test.cpp Adds GH-6289 regression tests for string-node emission behavior across various regex constructs.
stl/inc/regex Introduces _Builder3 buffering/emission logic for _N_str nodes and renames _Parser2/_Builder2 to _Parser3/_Builder3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread stl/inc/regex Outdated
template <class _FwdIt, class _Elem, class _RxTraits>
_Node_base* _Builder2<_FwdIt, _Elem, _RxTraits>::_Begin_group() { // add group node
_Node_base* _Builder3<_FwdIt, _Elem, _RxTraits>::_Begin_group() { // add group node
_Emit_str_node();
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@StephanTLavavej StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels May 17, 2026
@StephanTLavavej StephanTLavavej self-assigned this May 17, 2026
@StephanTLavavej
Copy link
Copy Markdown
Member

/azp run STL-CI

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread stl/inc/regex Outdated
_Node_base* _Current;
regex_constants::syntax_option_type _Flags;
const _RxTraits& _Traits;
vector<_Elem> _Chars;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, a string is the more appropriate container here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster regex meow is a substring of homeowner

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

4 participants