Skip to content

Integrate SimdUnicode UTF-8 Validation for AdvSimd#122090

Open
ylpoonlg wants to merge 8 commits intodotnet:mainfrom
ylpoonlg:github-utf8-validation
Open

Integrate SimdUnicode UTF-8 Validation for AdvSimd#122090
ylpoonlg wants to merge 8 commits intodotnet:mainfrom
ylpoonlg:github-utf8-validation

Conversation

@ylpoonlg
Copy link
Copy Markdown
Contributor

@ylpoonlg ylpoonlg commented Dec 2, 2025

Contributes to #103781, based on #104199.

Includes some changes to reduce code duplication and reuse existing code such as VectorContainsNonAsciiChar to fit in with the library. Most intrinsics are replaced with Vector128 APIs. Otherwise, I tried to encapsulate where platform-specific intrinsics are used so that it will be easier to extend to other Vector128 platforms. That would just require replacing AdvSimd.Arm64.VectorTableLookup with Vector128.Shuffle and some Vector128 replacement for AdvSimd.ExtractVector128, which can be done in a later PR.
The unit tests are also modified to improve the coverage and test cases.

Benchmark results

Neoverse-V2:

Method Input Version Mean Error Ratio
GetCharCount EnglishAllAscii Before 3.113 us 0.0026 us 1.000
GetCharCount EnglishAllAscii After 3.114 us 0.0022 us 1.000
GetCharCount EnglishMostlyAscii Before 31.269 us 0.0757 us 1.000
GetCharCount EnglishMostlyAscii After 9.970 us 0.0094 us 0.319
GetCharCount Chinese Before 34.602 us 0.1955 us 1.000
GetCharCount Chinese After 22.637 us 1.1302 us 0.654
GetCharCount Cyrillic Before 26.021 us 0.2613 us 1.000
GetCharCount Cyrillic After 11.644 us 0.0243 us 0.447
GetCharCount Greek Before 66.024 us 0.3077 us 1.000
GetCharCount Greek After 15.824 us 0.8816 us 0.240

Neoverse-N2:

Method Input Version Mean Error Ratio
GetCharCount EnglishAllAscii Before 10.390 us 0.0060 us 1.000
GetCharCount EnglishAllAscii After 3.522 us 0.0031 us 0.339
GetCharCount EnglishMostlyAscii Before 32.390 us 0.0140 us 1.000
GetCharCount EnglishMostlyAscii After 16.587 us 0.0501 us 0.512
GetCharCount Chinese Before 46.510 us 0.2420 us 1.000
GetCharCount Chinese After 38.449 us 0.0489 us 0.827
GetCharCount Cyrillic Before 39.520 us 0.2690 us 1.000
GetCharCount Cyrillic After 19.827 us 0.0017 us 0.502
GetCharCount Greek Before 85.660 us 0.4650 us 1.000
GetCharCount Greek After 26.648 us 0.0026 us 0.311

cc @dotnet/arm64-contrib @lemire @EgorBo @a74nh @SwapnilGaikwad

@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@ylpoonlg ylpoonlg marked this pull request as ready for review December 17, 2025 11:54
contv and n4v are Vector128<sbyte> so the largest positive value is 127.
@dhartglassMSFT
Copy link
Copy Markdown
Contributor

pinging @tannergooding FYI for libraries change

@jeffhandley
Copy link
Copy Markdown
Member

@tannergooding This PR needs your review this coming week please.

Comment on lines +991 to +995
else
{
numContinuationBytes += Vector128.CountWhereAllBitsSet(byte2High);
numFourByteSequences += Vector128.CountWhereAllBitsSet(Vector128.SubtractSaturate(currentBlock, fourthByte));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like currently dead code? Any reason it can't just be the code used everywhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the default algorithm for other architectures, but it is particularly slow on AdvSimd (due to the lack of ExtractMostSignificantBits) as it is much faster to use AddAcross. Since only AdvSimd is enabled currently, this will be dead code. Should I comment them out so that they can be added back when other architectures are eventually enabled?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to just avoid dead code in general and we want to otherwise prioritize the JIT (or libraries implementation) optimizing CountWhereAllBitsSet for Arm64

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A viable way to handle this, for example, would be to just use CountWhereAllBitsSet and then update the managed implementation to use AddAcross on Arm64

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it fits in this case, AdvSimd requires a vector accumulator outside the loop that only updates when it overflows, otherwise doing AddAcross on every iteration would be very slow. This seems to be a very different approach than using CountWhereAllBitsSet, which is what the AdvSimd path is trying to bypass.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates a SimdUnicode-inspired UTF-8 validation fast path for Arm64 AdvSimd into Utf8Utility.GetPointerToFirstInvalidByte, aiming to improve throughput on mixed / non-ASCII inputs while reusing existing ASCII-vector helpers and expanding unit test coverage.

Changes:

  • Add an Arm64 Vector128-based UTF-8 validation path (SimdUnicode “lookup” algorithm) to Utf8Utility.GetPointerToFirstInvalidByte.
  • Expose Ascii.VectorContainsNonAsciiChar(Vector128<byte>) for reuse by the new validator.
  • Expand UTF-8 validation tests to exercise more insertion positions and add additional out-of-range 4-byte start byte coverage; add SimdUnicode to third-party notices.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/libraries/System.Runtime/tests/System.Runtime.Tests/System/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs Broaden invalid-sequence test coverage by inserting invalid sequences at more positions; adds F5..FF coverage.
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs Adds the Arm64 Vector128 SimdUnicode-style validator and routes to it for sufficiently large inputs.
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs Makes VectorContainsNonAsciiChar(Vector128<byte>) internal so it can be reused by UTF-8 validation.
THIRD-PARTY-NOTICES.TXT Adds SimdUnicode MIT license notice.

Comment on lines +292 to +307
byte[] byteVector = Utf8Tests.DecodeHex(E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE);

// Run the same tests but with extra data at the beginning so that we're inside one of
// the 2-byte processing "hot loop" code paths.
for (int pos = 0; pos <= 16; pos++)
{
ArrayList testList = new ArrayList(byteVector);

toTest = knownGoodBytes.Concat(knownGoodBytes).Concat(invalidSequence).Concat(knownGoodBytes).ToArray(); // at start of next DWORD
GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, 4, 2, 0);
if (pos % 2 != 0)
{
// Replace bytes with valid ASCII characters so they can be broken up.
testList.SetRange(pos - pos % 2, new byte[2] {0x20, 0x21});
}

toTest = knownGoodBytes.Concat(knownGoodBytes).Concat(knownGoodBytes).Concat(invalidSequence).Concat(knownGoodBytes).ToArray(); // at end of next DWORD
GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, 6, 3, 0);
testList.InsertRange(pos, invalidSequence);
byte[] toTest = (byte[])testList.ToArray(typeof(byte));
GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, pos, (pos + 1) / 2, 0);
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helpers allocate heavily: each invalid-sequence case creates a fresh byteVector via Utf8Tests.DecodeHex(...) and then, for each pos, builds an ArrayList which boxes every byte and allocates again on ToArray. Given the large outer loops (thousands of invalid sequences), this can significantly slow the test suite and increase GC pressure. Consider caching the known-good vectors (static readonly byte[]) and building toTest using byte[]/Span<byte> copies (or List<byte> at minimum) to avoid boxing.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants