`<regex>`: Emit complete _N_str nodes only during NFA construction by muellerj3 · Pull Request #6289 · microsoft/STL

muellerj3 · 2026-05-17T20:48:53Z

This is a side quest to #5962: If the parser no longer generates _N_group nodes, the current node in the NFA is the node just before the group when parsing of the group contents begins. This might be an _N_str node and if the groups starts with some character sequence, the current logic will attach these characters to the _N_str node just before the group. This would effectively move these characters outside of the non-capturing group without the _N_group node, thus miscompiling the regex.

For this reason, this PR changes this character sequence pooling logic a bit: Characters now get added to a vector _Chars internal to the _Builder3 class, until the following subexpression in the regex results in the attachment of some other kind of node to the NFA or the parser must remember the current position in the NFA. At that point, the builder will emit the string node and then append the other kind of node to the string node or return a pointer to the string node. This means that string nodes are no longer modified after they have been generated.

Another nice bonus for the future: The parser does not really need the geometric growth logic inside the _Buf class anymore for _N_str nodes (although it is still used for now). I plan to rework the parsing logic for character classes as well and intend to remove all of that resize logic inside the _Buf class at that time.

The _Builder2 and _Parser2 classes are renamed because the layout of the builder class changed, so these classes must be renamed to avoid an ABI break. The version numbers of _Parser3::_Alternative() and _Builder3::_Else_if() are removed because the class renaming renders them unnecessary.

muellerj3 · 2026-05-17T20:49:31Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

Updates the <regex> NFA builder to defer creation of _N_str nodes until the builder must commit pending literal characters, preventing incorrect pooling of characters across group boundaries when _N_group nodes are no longer generated (per #5962 / GH-6289).

Changes:

Reworked the regex NFA builder to buffer literal characters in _Builder3::_Chars and only emit _N_str nodes at commit points during construction.
Renamed internal parser/builder types (_Parser2/_Builder2 → _Parser3/_Builder3) to avoid ABI breaks after layout changes, and simplified related method naming.
Added targeted regression tests covering anchors, word boundaries, groups, lookahead, alternation, and captures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
tests/std/tests/VSO_0000000_regex_use/test.cpp	Adds GH-6289 regression tests for string-node emission behavior across various regex constructs.
stl/inc/regex	Introduces `_Builder3` buffering/emission logic for `_N_str` nodes and renames `_Parser2/_Builder2` to `_Parser3/_Builder3`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

muellerj3 · 2026-05-17T21:31:21Z

 template <class _FwdIt, class _Elem, class _RxTraits>
-_Node_base* _Builder2<_FwdIt, _Elem, _RxTraits>::_Begin_group() { // add group node
+_Node_base* _Builder3<_FwdIt, _Elem, _RxTraits>::_Begin_group() { // add group node
+    _Emit_str_node();


Good catch.

StephanTLavavej · 2026-05-17T21:10:06Z

/azp run STL-CI

azure-pipelines · 2026-05-17T21:10:20Z

Azure Pipelines successfully started running 1 pipeline(s).

muellerj2 · 2026-05-18T19:04:32Z

    _Node_base* _Current;
    regex_constants::syntax_option_type _Flags;
    const _RxTraits& _Traits;
+    vector<_Elem> _Chars;


In hindsight, a string is the more appropriate container here.

<regex>: Emit complete _N_str nodes only during NFA construction

22ed25b

Copilot AI review requested due to automatic review settings May 17, 2026 20:48

muellerj3 requested a review from a team as a code owner May 17, 2026 20:48

github-project-automation Bot added this to STL Code Reviews May 17, 2026

github-project-automation Bot moved this to Initial Review in STL Code Reviews May 17, 2026

Copilot started reviewing on behalf of muellerj3 May 17, 2026 20:49 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels May 17, 2026

StephanTLavavej self-assigned this May 17, 2026

muellerj2 reviewed May 18, 2026

View reviewed changes

use string, address copilot feedback

a961d71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<regex>`: Emit complete _N_str nodes only during NFA construction#6289

`<regex>`: Emit complete _N_str nodes only during NFA construction#6289
muellerj3 wants to merge 2 commits into
microsoft:mainfrom
muellerj2:regex-emit-complete-string-nodes-only

muellerj3 commented May 17, 2026

Uh oh!

muellerj3 commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

muellerj3 May 17, 2026

Uh oh!

StephanTLavavej commented May 17, 2026

Uh oh!

azure-pipelines Bot commented May 17, 2026

Uh oh!

muellerj2 May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

muellerj3 commented May 17, 2026

Uh oh!

muellerj3 commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

muellerj3 May 17, 2026

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej commented May 17, 2026

Uh oh!

azure-pipelines Bot commented May 17, 2026

Uh oh!

muellerj2 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants