Integrate SimdUnicode UTF-8 Validation for AdvSimd by ylpoonlg · Pull Request #122090 · dotnet/runtime

ylpoonlg · 2025-12-02T12:12:53Z

Contributes to #103781, based on #104199.

Includes some changes to reduce code duplication and reuse existing code such as VectorContainsNonAsciiChar to fit in with the library. Most intrinsics are replaced with Vector128 APIs. Otherwise, I tried to encapsulate where platform-specific intrinsics are used so that it will be easier to extend to other Vector128 platforms. That would just require replacing AdvSimd.Arm64.VectorTableLookup with Vector128.Shuffle and some Vector128 replacement for AdvSimd.ExtractVector128, which can be done in a later PR.
The unit tests are also modified to improve the coverage and test cases.

Benchmark results

Neoverse-V2:

Method	Input	Version	Mean	Error	Ratio
GetCharCount	EnglishAllAscii	Before	3.113 us	0.0026 us	1.000
GetCharCount	EnglishAllAscii	After	3.114 us	0.0022 us	1.000
GetCharCount	EnglishMostlyAscii	Before	31.269 us	0.0757 us	1.000
GetCharCount	EnglishMostlyAscii	After	9.970 us	0.0094 us	0.319
GetCharCount	Chinese	Before	34.602 us	0.1955 us	1.000
GetCharCount	Chinese	After	22.637 us	1.1302 us	0.654
GetCharCount	Cyrillic	Before	26.021 us	0.2613 us	1.000
GetCharCount	Cyrillic	After	11.644 us	0.0243 us	0.447
GetCharCount	Greek	Before	66.024 us	0.3077 us	1.000
GetCharCount	Greek	After	15.824 us	0.8816 us	0.240

Neoverse-N2:

Method	Input	Version	Mean	Error	Ratio
GetCharCount	EnglishAllAscii	Before	10.390 us	0.0060 us	1.000
GetCharCount	EnglishAllAscii	After	3.522 us	0.0031 us	0.339
GetCharCount	EnglishMostlyAscii	Before	32.390 us	0.0140 us	1.000
GetCharCount	EnglishMostlyAscii	After	16.587 us	0.0501 us	0.512
GetCharCount	Chinese	Before	46.510 us	0.2420 us	1.000
GetCharCount	Chinese	After	38.449 us	0.0489 us	0.827
GetCharCount	Cyrillic	Before	39.520 us	0.2690 us	1.000
GetCharCount	Cyrillic	After	19.827 us	0.0017 us	0.502
GetCharCount	Greek	Before	85.660 us	0.4650 us	1.000
GetCharCount	Greek	After	26.648 us	0.0026 us	0.311

cc @dotnet/arm64-contrib @lemire @EgorBo @a74nh @SwapnilGaikwad

dotnet-policy-service · 2025-12-09T15:06:17Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

contv and n4v are Vector128<sbyte> so the largest positive value is 127.

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

dhartglassMSFT · 2026-01-26T21:31:51Z

pinging @tannergooding FYI for libraries change

jeffhandley · 2026-03-08T21:18:23Z

@tannergooding This PR needs your review this coming week please.

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

tannergooding · 2026-03-23T19:53:07Z

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

+                    else
+                    {
+                        numContinuationBytes += Vector128.CountWhereAllBitsSet(byte2High);
+                        numFourByteSequences += Vector128.CountWhereAllBitsSet(Vector128.SubtractSaturate(currentBlock, fourthByte));
+                    }


This looks like currently dead code? Any reason it can't just be the code used everywhere?

This should be the default algorithm for other architectures, but it is particularly slow on AdvSimd (due to the lack of ExtractMostSignificantBits) as it is much faster to use AddAcross. Since only AdvSimd is enabled currently, this will be dead code. Should I comment them out so that they can be added back when other architectures are eventually enabled?

I think we want to just avoid dead code in general and we want to otherwise prioritize the JIT (or libraries implementation) optimizing CountWhereAllBitsSet for Arm64

A viable way to handle this, for example, would be to just use CountWhereAllBitsSet and then update the managed implementation to use AddAcross on Arm64

I'm not sure if it fits in this case, AdvSimd requires a vector accumulator outside the loop that only updates when it overflows, otherwise doing AddAcross on every iteration would be very slow. This seems to be a very different approach than using CountWhereAllBitsSet, which is what the AdvSimd path is trying to bypass.

Copilot

Pull request overview

This PR integrates a SimdUnicode-inspired UTF-8 validation fast path for Arm64 AdvSimd into Utf8Utility.GetPointerToFirstInvalidByte, aiming to improve throughput on mixed / non-ASCII inputs while reusing existing ASCII-vector helpers and expanding unit test coverage.

Changes:

Add an Arm64 Vector128-based UTF-8 validation path (SimdUnicode “lookup” algorithm) to Utf8Utility.GetPointerToFirstInvalidByte.
Expose Ascii.VectorContainsNonAsciiChar(Vector128<byte>) for reuse by the new validator.
Expand UTF-8 validation tests to exercise more insertion positions and add additional out-of-range 4-byte start byte coverage; add SimdUnicode to third-party notices.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
src/libraries/System.Runtime/tests/System.Runtime.Tests/System/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs	Broaden invalid-sequence test coverage by inserting invalid sequences at more positions; adds F5..FF coverage.
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs	Adds the Arm64 Vector128 SimdUnicode-style validator and routes to it for sufficiently large inputs.
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs	Makes `VectorContainsNonAsciiChar(Vector128<byte>)` `internal` so it can be reused by UTF-8 validation.
THIRD-PARTY-NOTICES.TXT	Adds SimdUnicode MIT license notice.

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

Copilot · 2026-03-23T20:00:18Z

...tem.Runtime/tests/System.Runtime.Tests/System/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs

+            byte[] byteVector = Utf8Tests.DecodeHex(E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE);

-            // Run the same tests but with extra data at the beginning so that we're inside one of
-            // the 2-byte processing "hot loop" code paths.
+            for (int pos = 0; pos <= 16; pos++)
+            {
+                ArrayList testList = new ArrayList(byteVector);

-            toTest = knownGoodBytes.Concat(knownGoodBytes).Concat(invalidSequence).Concat(knownGoodBytes).ToArray(); // at start of next DWORD
-            GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, 4, 2, 0);
+                if (pos % 2 != 0)
+                {
+                    // Replace bytes with valid ASCII characters so they can be broken up.
+                    testList.SetRange(pos - pos % 2, new byte[2] {0x20, 0x21});
+                }

-            toTest = knownGoodBytes.Concat(knownGoodBytes).Concat(knownGoodBytes).Concat(invalidSequence).Concat(knownGoodBytes).ToArray(); // at end of next DWORD
-            GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, 6, 3, 0);
+                testList.InsertRange(pos, invalidSequence);
+                byte[] toTest = (byte[])testList.ToArray(typeof(byte));
+                GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, pos, (pos + 1) / 2, 0);
+            }


These helpers allocate heavily: each invalid-sequence case creates a fresh byteVector via Utf8Tests.DecodeHex(...) and then, for each pos, builds an ArrayList which boxes every byte and allocates again on ToArray. Given the large outer loops (thousands of invalid sequences), this can significantly slow the test suite and increase GC pressure. Consider caching the known-good vectors (static readonly byte[]) and building toTest using byte[]/Span<byte> copies (or List<byte> at minimum) to avoid boxing.

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Integrate SimdUnicode UTF-8 Validation for AdvSimd

ca8f6ad

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 2, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 2, 2025

a74nh added area-System.Runtime.Intrinsics and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Dec 9, 2025

ylpoonlg marked this pull request as ready for review December 17, 2025 11:54

ylpoonlg added 2 commits December 18, 2025 10:09

Merge branch 'main' into 'github-utf8-validation'

f7cffa7

Fix overflow counter

a711ed0

contv and n4v are Vector128<sbyte> so the largest positive value is 127.

lemire reviewed Dec 18, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs Show resolved Hide resolved

jeffhandley assigned tannergooding Feb 1, 2026

jeffhandley requested a review from tannergooding February 1, 2026 20:27

EgorBo mentioned this pull request Mar 13, 2026

Integrate SimdUnicode for AVX-512 #104199

Closed