Conversation
|
Tagging subscribers to this area: @dotnet/area-system-text-encoding |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
@EgorBot -intel -amd using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Unicode;
BenchmarkRunner.Run<Bench>(args: args);
public class Bench
{
public static IEnumerable<byte[]> GetUtf8BytesData()
{
// Chinese "Lorem Ipsum"
var utf8 = "唐聞球方五保査禁答近確掲著協世好知長。育乗江校上価話戒宏口自森特室堂討。陸迎奔必秋最量注好枚挑周。間父癒曲在近真権幕覧超持樹件芸保展島船点。齢度約治末価埼坂内辞千故資接藤雨約宿県。定戻業担伸立発告敗家響意球禎。呼真局験善体続得新税知群孫大場。変省創与毎容開拡作北経眺間。樹野市現館開分供同南費海。投以画露両装知全茨済力上速田弘変掲材保内。王野嗅結択芸合験覧託委致就近資。励意親者著識連愚戦親能精球信相準大避一。民覧過走最国転開社加砲者度座図。提著学月牟止百県意能宝質約投分記加。中長塚相選暇版経田経問下訟全報府。要事集細両体要特義点必周優載治山集摘。手機掛果題銀料新政庁分堀住画禁信。味表柄読必望著後入協攻末源安 案志検江水口宿言京並属需就一生断導。通崎楽大最放新属健戦維議本金部兜素定市船"u8.ToArray();
yield return utf8.AsSpan(0, 1000).ToArray();
yield return utf8.AsSpan(0, 500).ToArray();
yield return utf8.AsSpan(0, 250).ToArray();
yield return utf8.AsSpan(0, 100).ToArray();
}
[Benchmark]
[ArgumentsSource(nameof(GetUtf8BytesData))]
public int GetUtf8Bytes(byte[] str) => Encoding.UTF8.GetCharCount(str);
public static IEnumerable<byte[]> ValidateUtf8Data()
{
// ru-RU "Lorem Ipsum"
var utf8 = "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8.ToArray();
yield return utf8.AsSpan(0, 1000).ToArray();
yield return utf8.AsSpan(0, 500).ToArray();
yield return utf8.AsSpan(0, 250).ToArray();
yield return utf8.AsSpan(0, 100).ToArray();
}
[Benchmark]
[ArgumentsSource(nameof(ValidateUtf8Data))]
public bool ValidateUtf8(byte[] str) => Utf8.IsValid(str);
} |
Benchmark results on Intel
|
Benchmark results on Amd
|
gfoidl
left a comment
There was a problem hiding this comment.
Is this better for the registers?
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Günther Foidl <gue@korporal.at>
|
Question regarding the PR title: it seems using AVX2 (256), not AVX-512 |
|
@huoyaoyuan I think that's a good point. This does seem to be AVX2 (which is not a bad idea). |
|
@SwapnilGaikwad @ylpoonlg Note that SimdUnicode has already a good implementation for ARM processors. |
Thanks for providing the implementation. It will be a similar idea to this PR to port from SimdUnicode, but with some cleanups to fit into the code using existing functions and Vector128 APIs. We hope that this wouldn't overlap too much with the work done in this PR. |
Hi @lemire , we are essentially porting the implementation you referred to (hope you're ok with it 🙂). Just needs a few changes to fit it in the library routine. We will benchmark with the Utf8Encoding benchmarks on newer machines to see how well it performs. |
|
@SwapnilGaikwad Yes. Please do. |
|
Closing in favor of #122090 |
Contributes to #103781, only for AVX-512, other ISAs can be added if/once this is approved/merged.
I did some clean up, like replacing some SIMD apis with cross-platform ones/operators. Btw, I don't believe that
ISimdVectorcan be used here. Also, I removed the initial "skip ASCII data" part since we already have a work horse for that.cc @lemire, @Nick-Nuon let me know if you want to change something (including credits in THIRD-PARTY-NOTICES.TXT)
TODO: do some ad-hoc testing, make sure test coverage is good