Skip to content

Rework RegexGenerator incremental source model to support value equality#125438

Open
eiriktsarpalis wants to merge 24 commits intomainfrom
fix/regex-generator-incrementality
Open

Rework RegexGenerator incremental source model to support value equality#125438
eiriktsarpalis wants to merge 24 commits intomainfrom
fix/regex-generator-incrementality

Conversation

@eiriktsarpalis
Copy link
Copy Markdown
Member

@eiriktsarpalis eiriktsarpalis commented Mar 11, 2026

Fixes #125409

Discovered while working on #125404 — the RegexGenerator's incremental source model lacks proper value equality, causing the source output callback to fire on every compilation change regardless of whether regex patterns have actually changed.

Approach

Refactors the generator following the STJ/Configuration Binder pattern with distinct Parser/Emitter files and deeply equatable model types (*Spec records using ImmutableEquatableArray, ImmutableEquatableSet, and ImmutableEquatableDictionary).

The emitter re-parses the regex from Pattern + Options + CultureName during emission to reconstruct the mutable RegexTree/RegexNode objects rather than porting the 5800-line emitter to use *Spec types. This re-parse is fast (~1ms) and only runs on cache misses, while the expensive emission (10-100ms) is fully skipped on cache hits.

19 new incremental compilation tests verify caching behavior via TrackedOutputSteps.

eiriktsarpalis and others added 4 commits March 11, 2026 13:05
Add RegexGeneratorIncrementalTests.cs with 18 tests covering:
- Same compilation caching (SameInput_DoesNotRegenerate)
- Unrelated code changes (UnrelatedCodeChange_DoesNotRegenerate)
- Pattern/options/timeout/culture changes triggering regeneration
- Declaring type and member name changes
- Method add/remove scenarios
- Property vs method support
- Complex patterns with unrelated changes (Theory with 5 patterns)
- Limited support regex (NonBacktracking) caching
- Multiple regex methods with partial changes

These tests establish the baseline behavior before refactoring the
generator's incremental pipeline for proper value equality.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gex generator

Add ImmutableEquatableDictionary<TKey, TValue> to Common/src/SourceGenerators/
following the same pattern as the existing ImmutableEquatableArray<T>.

Link ImmutableEquatableArray, ImmutableEquatableDictionary, and HashHelpers
to the Regex generator project for use in the upcoming equatable model types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…line

Add gen/Model/ directory with immutable, structurally equatable record types:
- RegexGenerationSpec: top-level envelope (incremental cache boundary)
- RegexMethodSpec: per-method data with declaring type, pattern, options, tree
- RegexTreeSpec: equatable regex tree with baked-in analysis results
- RegexNodeSpec: equatable tree node with analysis flags per node
- FindOptimizationsSpec: equatable find optimizations
- FixedDistanceSetSpec: equatable fixed-distance set
- RegexTypeSpec: declaring type hierarchy

Also add required member polyfills for netstandard2.0 compatibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssion

Rewrites the RegexGenerator Initialize() pipeline to:
1. Collect per-method results into a single batch
2. Run a Parser that builds an equatable RegexGenerationSpec model
3. Split into source model (with value equality) and diagnostics providers
4. Register separate source outputs for model and diagnostics

The equatable model (RegexGenerationSpec -> RegexMethodSpec -> RegexTreeSpec ->
RegexNodeSpec) enables Roslyn to skip the emitter when regex patterns/options
haven't changed.

During emission, the spec is converted back to the mutable RegexMethod/RegexTree
types by re-parsing the regex pattern. This re-parse is fast (~1ms) and only
occurs on incremental cache misses, while the expensive emission (10-100ms) is
fully skipped on cache hits.

Also fixes CultureChange_Regenerates test to check incremental step reason
rather than output string equality, since culture changes may not always
produce different generated code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 12:55
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the RegexGenerator incremental source generator pipeline to use deeply equatable model types, enabling Roslyn incremental caching to skip expensive emission when regex inputs haven’t changed.

Changes:

  • Replaced the heterogeneous ImmutableArray<object> incremental model with equatable *Spec model records (using ImmutableEquatableArray / new ImmutableEquatableDictionary).
  • Reworked the pipeline to split source emission from diagnostic reporting and re-parse the regex during emission only on cache misses.
  • Added a new incremental-caching-focused test suite and wired it into the functional test project.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj Adds the new incremental generator tests to the test project.
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorIncrementalTests.cs New tests validating incremental caching behavior via tracked generator steps.
src/libraries/System.Text.RegularExpressions/gen/System.Text.RegularExpressions.Generator.csproj Links shared equatable collection helpers and required-member polyfills into the generator project.
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.cs Pipeline rewrite to produce a structurally equatable model and re-parse on emission.
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Conversion.cs Converts RegexTree / analysis into equatable *Spec snapshots.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexGenerationSpec.cs New top-level equatable incremental cache boundary model.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexMethodSpec.cs New per-member equatable model for generation inputs.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexTreeSpec.cs New equatable representation of regex trees and baked-in analysis flags.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexNodeSpec.cs New equatable representation of regex parse nodes.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexTypeSpec.cs New equatable representation of declaring type hierarchy.
src/libraries/System.Text.RegularExpressions/gen/Model/FindOptimizationsSpec.cs New equatable representation of find-optimization state.
src/libraries/System.Text.RegularExpressions/gen/Model/FixedDistanceSetSpec.cs New equatable representation of fixed-distance set optimization data.
src/libraries/Common/src/SourceGenerators/ImmutableEquatableDictionary.cs New shared equatable dictionary helper for structural equality in generator models.

You can also share your feedback on Copilot code review. Take the survey.

eiriktsarpalis and others added 2 commits March 11, 2026 15:29
…mitter

- F1: Extract LiteralAfterLoopSpec record from nested tuple in
  FindOptimizationsSpec for clarity.
- F2+F3: Rewrite RegexGenerator.Parser.cs with unified parsing flow.
  Parse() is the outer method accepting GeneratorAttributeSyntaxContext
  batch; ParseMethod() handles each member end-to-end (attribute
  extraction + regex parsing + spec conversion). Eliminates the
  intermediate RegexPatternAndSyntax type and ImmutableArray<object>
  boxing.
- F4: Move Emit() orchestration and ConvertRegexTypeSpecToRegexType to
  RegexGenerator.Emitter.cs. RegexGenerator.cs now only contains
  Initialize() pipeline wiring, constants, and CompilationData.
- Fix parent chain construction for immutable RegexTypeSpec records
  (build outside-in instead of mutating).
- Remove unused usings from slimmed RegexGenerator.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 13:58
Add ToImmutableEquatableDictionary<TKey, TValue>(this Hashtable)
extension method following the existing factory pattern, and simplify
the ConvertHashtable caller in Conversion.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.


You can also share your feedback on Copilot code review. Take the survey.

- Add ImmutableEquatableSet<T> (from PolyType) providing order-
  independent structural equality via HashSet<T>. Use it for
  RegexGenerationSpec.RegexMethods to prevent spurious cache misses
  from non-deterministic ordering of GeneratorAttributeSyntaxContext.
- Fix null-safety bug in ImmutableEquatableDictionary.Equals: use
  EqualityComparer<TValue>.Default.Equals instead of direct call
  that would NRE on null values.
- Remove unused keySelector/valueSelector and IEnumerable<KVP>
  overloads from ImmutableEquatableDictionary factory.
- Inline ConvertHashtable calls in Conversion.cs.
- Extract AddDiagnostic helper to reduce noise in Parser.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 15:29
…culture

- Replace var with explicit types throughout changed code
- Change RegexMethod/RegexType from internal sealed record to private sealed class
- Add using var to runnerSw/runnerWriter for proper disposal
- Include CultureName in dedup key for emitted expressions
- Remove unused using SourceGenerators from Emitter.cs
- Add using System.Text to incremental tests
- Strengthen SameInput test with cached/unchanged assertion

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eiriktsarpalis eiriktsarpalis force-pushed the fix/regex-generator-incrementality branch from 53e38f5 to 93dd4a4 Compare March 11, 2026 15:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Parser.cs:150

  • The spec currently doesn’t preserve the cultureName argument from [GeneratedRegex]. CreateRegexTreeSpec later relies on tree.Culture?.Name, but RegexParser.Parse sets RegexTree.Culture to non-null only for ignore-case backreferences; for most ignore-case regexes the tree culture will be null. That means the emitter may re-parse using InvariantCulture even when a culture was specified, potentially generating incorrect code for culture-sensitive case folding. Store the requested/selected culture name directly on RegexMethodSpec (or alongside Tree) and use that for emission and de-duping.
            string? pattern = items[0].Value as string;
            int? options = null;
            int? matchTimeout = null;
            string? cultureName = string.Empty;
            if (items.Length >= 2)
            {
                options = items[1].Value as int?;
                if (items.Length == 4)
                {
                    matchTimeout = items[2].Value as int?;
                    cultureName = items[3].Value as string;
                }
                // If there are 3 parameters, we need to check if the third argument is
                // int matchTimeoutMilliseconds, or string cultureName.
                else if (items.Length == 3)
                {
                    if (items[2].Type?.SpecialType == SpecialType.System_Int32)
                    {
                        matchTimeout = items[2].Value as int?;
                    }
                    else
                    {
                        cultureName = items[2].Value as string;
                    }

You can also share your feedback on Copilot code review. Take the survey.

eiriktsarpalis and others added 2 commits March 11, 2026 17:48
Sort method specs by fully-qualified type name before ID assignment
and emission to ensure deterministic output regardless of HashSet
iteration order (which varies with string hash randomization).

Add test for CompilationData (AllowUnsafe) change triggering regeneration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move CompilationData record struct to gen/Model/CompilationData.cs
- Change Parse() accumulators from HashSet/ImmutableArray.Builder to List
- Reorder declarations so methods come before diagnostics

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 16:05
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review. Take the survey.

Inline diagnostic accumulation using (diagnostics ??= []).Add(...).
Remove sort in Emit() since ImmutableEquatableSet iteration order
is deterministic within a process and the emitter only runs on
cache misses.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eiriktsarpalis eiriktsarpalis force-pushed the fix/regex-generator-incrementality branch from 6421f6a to 4740968 Compare March 11, 2026 16:14
Copilot AI review requested due to automatic review settings March 12, 2026 15:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

eiriktsarpalis and others added 3 commits March 27, 2026 14:06
…ame on spec

- Rework ImmutableEquatableSet to use a sorted T[] array with Array.BinarySearch
  for lookups, ensuring deterministic enumeration order. Add IComparable<T> constraint.

- Rework ImmutableEquatableDictionary to use sorted TKey[]/TValue[] arrays with
  Array.BinarySearch for lookups, ensuring deterministic enumeration order.
  Add IComparable<TKey> constraint.

- Add CultureName property directly to RegexMethodSpec (not derived from
  Tree?.CultureName) so limited-support regexes correctly include culture
  in dedup keys and re-parse.

- Implement IComparable<RegexMethodSpec> for sorted set ordering by
  namespace/type/member/pattern/options/timeout/culture.

- Use methodSpec.CultureName in emitter for re-parse culture and dedup key
  instead of methodSpec.Tree?.CultureName.

- Fix test assertions: relax hard-coded suffix numbers, handle empty
  TrackedSteps when pipeline is fully cached.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert ImmutableEquatableSet and ImmutableEquatableDictionary to their
original HashSet/Dictionary-backed implementations (no IComparable<T>
constraint). Remove IComparable<RegexMethodSpec> implementation.

Instead, sort RegexMethodSpec entries deterministically in the emitter
(by namespace/type/member/pattern/options/timeout/culture) before
assigning incremental IDs, ensuring stable generated output without
adding constraints to the shared collection types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 27, 2026 12:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

@github-actions

This comment has been minimized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Copilot Code Review — PR #125438

Note

This review was generated by GitHub Copilot (Claude Opus 4.6).

Holistic Assessment

Motivation: Justified and important. The Regex source generator previously used object-typed pipeline values without structural equality, causing unnecessary re-emission on every keystroke in IDEs. Source generators that "murder the IDE" is a known, high-severity issue in this codebase.

Approach: Sound. The design follows the established Roslyn incremental generator best practices: equatable model records → ImmutableEquatable* wrappers → separated source/diagnostic pipelines. The re-parse-on-emission strategy is pragmatic — it avoids modifying the 6000-line emitter while keeping the equatable model clean of mutable RegexTree/AnalysisResults types.

Summary: ⚠️ Needs Human Review. The overall architecture is correct and well-tested, but the OrderBy in the emitter is missing CultureName, which was present in the prior commit's sort but dropped during the last simplification. A human reviewer should confirm whether this is a real non-determinism risk or is mitigated by the dedup key. Two additional minor items are flagged below.


Detailed Findings

⚠️ Non-deterministic sort — CultureName missing from OrderBy

In RegexGenerator.Emitter.cs line 60, the emitter sorts specs for stable ID assignment:

.OrderBy(m => (m.DeclaringType.Namespace, m.DeclaringType.Name, m.MemberName, 
               m.Pattern, (int)m.Options, m.MatchTimeout))

However, the deduplication key on line 74-75 correctly includes CultureName:

(string Pattern, RegexOptions Options, int? Timeout, string? CultureName) key =
    (regexMethod.Pattern, regexMethod.Options, regexMethod.MatchTimeout, methodSpec.CultureName);

This appears to be a regression introduced in commit 86693f5c8 ("Simplify emitter sort to single OrderBy with tuple key"), which consolidated separate OrderBy/ThenBy calls into a single tuple but dropped CultureName. The prior commit a2200482e explicitly described sorting "by namespace/type/member/pattern/options/timeout/culture".

Impact: Two regex methods with the same (Namespace, Name, MemberName, Pattern, Options, MatchTimeout) but different CultureName values (e.g., en-US vs tr-TR) will have undefined relative order. Since ImmutableEquatableSet uses a HashSet internally, iteration order varies across runs. The emitter's OrderBy is the sole determinism guarantee, but with CultureName missing, ID assignment (MemberName_{id++}) could flip between builds.

The SamePatternDifferentCultures_NotDeduped test verifies they're not deduped, but doesn't verify stable ordering.

Suggested fix — add CultureName (with null coalesce for ordering) to the sort tuple:

.OrderBy(m => (m.DeclaringType.Namespace, m.DeclaringType.Name, m.MemberName, 
               m.Pattern, (int)m.Options, m.MatchTimeout, m.CultureName ?? ""))

💡 CompilationData could be readonly record struct

CompilationData in Model/CompilationData.cs is declared as record struct, which is mutable. Since it's used as an immutable value in equatable model records, readonly record struct would communicate intent better and prevent accidental mutation:

internal readonly record struct CompilationData(bool AllowUnsafe, bool CheckOverflow, LanguageVersion LanguageVersion);

This is a minor style point — record struct equality is unaffected by the readonly modifier.


💡 Test name MultipleRegexes_OnlyChangedOneRegenerated is misleading

This test (line 466) checks that generated output contains the right patterns after a change, but it doesn't call AssertSourceModelCached or AssertSourceModelModified. Given the batched architecture (all methods collected into a single RegexGenerationSpec), changing any regex invalidates the entire model — there is no per-method incrementality. The test name suggests otherwise. Consider renaming to something like MultipleRegexes_ChangedOneProducesCorrectOutput.


✅ Equatable model design — correct and well-structured

The model records (RegexMethodSpec, RegexNodeSpec, RegexTreeSpec, etc.) correctly implement deep structural equality via C# records and ImmutableEquatable* collection wrappers. Value tuples like (char Char, string? String, int Distance) in FindOptimizationsSpec correctly participate in record equality. The ImmutableEquatableDictionary and ImmutableEquatableSet use XOR-based hashing, which is order-independent — correct for unordered collections.

✅ No Roslyn symbols stored in pipeline

The pipeline correctly avoids storing ISymbol, Compilation, SyntaxTree, or Location objects in the equatable model. All Roslyn-specific data is extracted in the Parse step and converted to plain data. This follows the Roslyn incremental generator best practices exactly.

✅ Diagnostic pipeline separation

Splitting source generation and diagnostics into separate RegisterSourceOutput calls correctly preserves #pragma warning disable support (fixing the issue described in #92509). Diagnostic objects retain their SourceLocation for pragma checking, while the equatable model remains free of Location references.

✅ Conversion layer and re-parse strategy

RegexGenerator.Conversion.cs cleanly converts mutable RegexTree/AnalysisResults to immutable specs. The re-parse in the emitter (RegexParser.Parse + RegexTreeAnalyzer.Analyze) correctly reconstructs the same tree using the same culture, pattern, and options. The CultureName null-handling is correct: nullCultureInfo.InvariantCulture.

✅ New collection types follow established patterns

ImmutableEquatableDictionary and ImmutableEquatableSet follow the same design patterns as the existing ImmutableEquatableArray: sealed class, IEquatable<T> implementation, UnsafeCreateFrom* factory, extension methods for LINQ-style creation, and HashHelpers.Combine for hashing.

✅ Test coverage is comprehensive

16 tests cover all key incrementality scenarios: same-input caching, unrelated changes, pattern/options/timeout/culture changes, type changes, method add/remove, property support, complex patterns, limited-support patterns, culture dedup, and compilation options. The AssertSourceModelCached/AssertSourceModelModified helpers use WithTrackingName for precise pipeline step verification.

✅ No dead code left behind

Old pipeline types (RegexPatternAndSyntax, GetRegexMethodDataOrFailureDiagnostic, DiagnosticLocation) are fully removed. The RegexMethod/RegexType classes in Parser.cs are still actively used by the emitter — they serve as the emitter's internal representation, avoiding modifications to the massive 6000-line emitter.

Generated by Code Review for issue #125438 ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Text.RegularExpressions source-generator Indicates an issue with a source generator feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RegexGenerator source model lacks value equality, preventing effective incremental caching

3 participants