Integrate SimdUnicode UTF-8 Validation for AdvSimd#122090
Integrate SimdUnicode UTF-8 Validation for AdvSimd#122090ylpoonlg wants to merge 8 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
contv and n4v are Vector128<sbyte> so the largest positive value is 127.
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Show resolved
Hide resolved
|
pinging @tannergooding FYI for libraries change |
|
@tannergooding This PR needs your review this coming week please. |
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
| else | ||
| { | ||
| numContinuationBytes += Vector128.CountWhereAllBitsSet(byte2High); | ||
| numFourByteSequences += Vector128.CountWhereAllBitsSet(Vector128.SubtractSaturate(currentBlock, fourthByte)); | ||
| } |
There was a problem hiding this comment.
This looks like currently dead code? Any reason it can't just be the code used everywhere?
There was a problem hiding this comment.
This should be the default algorithm for other architectures, but it is particularly slow on AdvSimd (due to the lack of ExtractMostSignificantBits) as it is much faster to use AddAcross. Since only AdvSimd is enabled currently, this will be dead code. Should I comment them out so that they can be added back when other architectures are eventually enabled?
There was a problem hiding this comment.
I think we want to just avoid dead code in general and we want to otherwise prioritize the JIT (or libraries implementation) optimizing CountWhereAllBitsSet for Arm64
There was a problem hiding this comment.
A viable way to handle this, for example, would be to just use CountWhereAllBitsSet and then update the managed implementation to use AddAcross on Arm64
There was a problem hiding this comment.
I'm not sure if it fits in this case, AdvSimd requires a vector accumulator outside the loop that only updates when it overflows, otherwise doing AddAcross on every iteration would be very slow. This seems to be a very different approach than using CountWhereAllBitsSet, which is what the AdvSimd path is trying to bypass.
There was a problem hiding this comment.
Pull request overview
This PR integrates a SimdUnicode-inspired UTF-8 validation fast path for Arm64 AdvSimd into Utf8Utility.GetPointerToFirstInvalidByte, aiming to improve throughput on mixed / non-ASCII inputs while reusing existing ASCII-vector helpers and expanding unit test coverage.
Changes:
- Add an Arm64
Vector128-based UTF-8 validation path (SimdUnicode “lookup” algorithm) toUtf8Utility.GetPointerToFirstInvalidByte. - Expose
Ascii.VectorContainsNonAsciiChar(Vector128<byte>)for reuse by the new validator. - Expand UTF-8 validation tests to exercise more insertion positions and add additional out-of-range 4-byte start byte coverage; add SimdUnicode to third-party notices.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/libraries/System.Runtime/tests/System.Runtime.Tests/System/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs | Broaden invalid-sequence test coverage by inserting invalid sequences at more positions; adds F5..FF coverage. |
| src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs | Adds the Arm64 Vector128 SimdUnicode-style validator and routes to it for sufficiently large inputs. |
| src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs | Makes VectorContainsNonAsciiChar(Vector128<byte>) internal so it can be reused by UTF-8 validation. |
| THIRD-PARTY-NOTICES.TXT | Adds SimdUnicode MIT license notice. |
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
| byte[] byteVector = Utf8Tests.DecodeHex(E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE + E_ACUTE); | ||
|
|
||
| // Run the same tests but with extra data at the beginning so that we're inside one of | ||
| // the 2-byte processing "hot loop" code paths. | ||
| for (int pos = 0; pos <= 16; pos++) | ||
| { | ||
| ArrayList testList = new ArrayList(byteVector); | ||
|
|
||
| toTest = knownGoodBytes.Concat(knownGoodBytes).Concat(invalidSequence).Concat(knownGoodBytes).ToArray(); // at start of next DWORD | ||
| GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, 4, 2, 0); | ||
| if (pos % 2 != 0) | ||
| { | ||
| // Replace bytes with valid ASCII characters so they can be broken up. | ||
| testList.SetRange(pos - pos % 2, new byte[2] {0x20, 0x21}); | ||
| } | ||
|
|
||
| toTest = knownGoodBytes.Concat(knownGoodBytes).Concat(knownGoodBytes).Concat(invalidSequence).Concat(knownGoodBytes).ToArray(); // at end of next DWORD | ||
| GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, 6, 3, 0); | ||
| testList.InsertRange(pos, invalidSequence); | ||
| byte[] toTest = (byte[])testList.ToArray(typeof(byte)); | ||
| GetIndexOfFirstInvalidUtf8Sequence_Test_Core(toTest, pos, (pos + 1) / 2, 0); | ||
| } |
There was a problem hiding this comment.
These helpers allocate heavily: each invalid-sequence case creates a fresh byteVector via Utf8Tests.DecodeHex(...) and then, for each pos, builds an ArrayList which boxes every byte and allocates again on ToArray. Given the large outer loops (thousands of invalid sequences), this can significantly slow the test suite and increase GC pressure. Consider caching the known-good vectors (static readonly byte[]) and building toTest using byte[]/Span<byte> copies (or List<byte> at minimum) to avoid boxing.
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Contributes to #103781, based on #104199.
Includes some changes to reduce code duplication and reuse existing code such as
VectorContainsNonAsciiCharto fit in with the library. Most intrinsics are replaced with Vector128 APIs. Otherwise, I tried to encapsulate where platform-specific intrinsics are used so that it will be easier to extend to other Vector128 platforms. That would just require replacingAdvSimd.Arm64.VectorTableLookupwithVector128.Shuffleand some Vector128 replacement forAdvSimd.ExtractVector128, which can be done in a later PR.The unit tests are also modified to improve the coverage and test cases.
Benchmark results
Neoverse-V2:
Neoverse-N2:
cc @dotnet/arm64-contrib @lemire @EgorBo @a74nh @SwapnilGaikwad