Conversation
Improves performance by about 4% on large files.
bytecount uses vector operations to speed up line counting. At least on x86 with AVX2 support, the vectors are 256-byte wide, and operations are much faster if the data is aligned. Saves about 4% of total performance, matching wc's performance.
|
GNU testsuite comparison: |
|
Nice, same result on my machine, which is a 7040 series Ryzen laptop chip. |
|
well done! |
It's actually a lot better than what I see. Interesting! Thanks for testing. |
|
previously the coreutils implementation was about 10x faster on my m1 mac, now its about 13x faster! |
I wonder how it is possible :) wc is the BSD apple implementation ? |
|
yes, i am using the default apple implementation |
|
@willshuttleworth oh nice, thanks! I'm not completely sure about the aarch64 code, but I see some loading operations on 8x16x4, which sound like 512-bit/64-bytes (the underlying registers are 128-bit wide though). Mind trying to see if 64-byte alignment improves performance? (you can Thanks! |
|
@drinkcat i'm not seeing a difference with 64 byte alignment: |
|
Neat, thanks for trying! |
Fixes #7929.
wc: Align buffer to 32-byte boundary
bytecount uses vector operations to speed up line counting.
At least on x86 with AVX2 support, the vectors are 256-byte wide,
and operations are much faster if the data is aligned.
Saves about 4% of total performance, matching wc's performance.
wc: Increase buffer size to 256kb
Improves performance by about 4% on large files.
This gets us close or better than GNU's version:
And on 1brc dataset from original report: