fix(sort): Enable locale-aware collation for UTF-8 locales #9176
fix(sort): Enable locale-aware collation for UTF-8 locales #9176sylvestre merged 11 commits intouutils:mainfrom
Conversation
CodSpeed Performance ReportMerging this PR will degrade performance by 3.67%Comparing Summary
Performance Changes
Footnotes
|
|
The performance regressions are
in du benchmarks, not sort. This is because the fix
modifies uucore's i18n module, which is a shared
dependency.
Root Cause
The i18n-common feature is now enabled in sort's
Cargo.toml, which pulls in the i18n module for all
uucore users. However, the actual performance impact
should be minimal because:
1. The locale check is O(1) - just reading a cached
environment variable
2. C/POSIX locales use the fast path - zero overhead
3. Only UTF-8 locales trigger ICU - which is the correct
behavior
Investigation Needed
The 2-3.5% regression in du seems unexpectedly high for
just adding locale awareness. This could be:
1. Compilation artifact - More code in uucore binary
2. Initialization overhead - ICU collator being
initialized even when not needed
3. False positive - Need to run benchmarks multiple
times to confirm
Request
Can we get:
1. Confirmation that the regression is consistent across
multiple runs
2. Sort-specific benchmarks to verify there's no
regression in C/POSIX locales
3. Guidance on whether this tradeoff is acceptable for
the project
I'm happy to investigate and implement whichever
approach the maintainers prefer!
El vie, 7 nov 2025 a las 19:02, codspeed-hq[bot] ***@***.***>)
escribió:
… *codspeed-hq[bot]* left a comment (uutils/coreutils#9176)
<#9176 (comment)>
CodSpeed Performance Report
<https://codspeed.io/uutils/coreutils/branches/quantum-encoding%3Afix-sort-locale-issue-9148?utm_source=github&utm_medium=comment&utm_content=header> Merging
#9176 <#9176> will *degrade
performances by 3.51%*
Comparing quantum-encoding:fix-sort-locale-issue-9148 (e539bdc
<e539bdc>)
with main (5b79ae4
<5b79ae4>
)
Summary
⚡ 1 improvement
❌ 4 regressions
✅ 120 untouched
|
|
please don't copy and paste AI content. it is human reading the comments and we would like to get only the relevant information. the du change is unrelated |
|
and please fix the failing jobs |
|
PROBLEM: SOLUTION: |
src/uu/timeout/src/timeout.rs
Outdated
| use crate::status::ExitStatus; | ||
| use clap::{Arg, ArgAction, Command}; | ||
| use std::io::ErrorKind; | ||
| use nix::errno::Errno; |
There was a problem hiding this comment.
why are you touching this file in a change for sort ?
There was a problem hiding this comment.
Sorry, I accidentally branched from a commit that included my timeout work. I've rebased and force-pushed - the sort PR now only contains sort-related changes
2cf888c to
a1d0437
Compare
|
GNU testsuite comparison: |
src/uu/sort/src/sort.rs
Outdated
| // Check if we need locale-aware collation and initialize collator if needed | ||
| // This MUST happen before init_precomputed() to avoid the performance regression | ||
| #[cfg(feature = "i18n-collator")] | ||
| let needs_locale_collation = { |
There was a problem hiding this comment.
i think it can be moved to
src/uucore/src/lib/mods/locale.rs
into a new function
i think we need it elsewhere?
efd2027 to
7153e28
Compare
|
GNU testsuite comparison: |
6d14180 to
3c7c661
Compare
|
GNU testsuite comparison: |
|
could you please rebase it ? |
locales Fixes uutils#9148 The sort implementation had locale support infrastructure (ICU collator) but it was never being used due to the fast_lexicographic optimization bypassing all locale-aware code. Changes: - Modified can_use_fast_lexicographic() to check locale encoding - For UTF-8 locales, disable fast path to use locale_cmp() - Initialize ICU collator with AlternateHandling::Shifted to match GNU - Enable i18n-common and i18n-collator features in sort's Cargo.toml Result: Perfect match with GNU sort for C, POSIX, and UTF-8 locales. No performance impact for C/POSIX locales (still use fast path).
Move locale collation initialization logic from sort.rs to uucore/i18n/collator.rs as suggested by maintainer. - Add init_locale_collation() function in collator.rs - Can be reused by ls, uniq, comm, join, and other utilities - Simplifies sort.rs by ~15 lines - No functional changes, just code reorganization
1305454 to
af7aced
Compare
|
rebased |
|
could you please add tests ? |
17f54cd to
f1bcf0e
Compare
tests/by-util/test_sort.rs
Outdated
| #[test] | ||
| fn test_locale_collation_utf8() { | ||
| // Skip if UTF-8 locale is not available | ||
| let Ok(locale) = env::var("LOCALE_FR_UTF8") else { |
There was a problem hiding this comment.
It's an existing pattern in the same file - see line 1636. Used by CI to specify available locale.
There was a problem hiding this comment.
are you sure it works ?
i don't see where it is set
There was a problem hiding this comment.
you're right, switched to en_US.UTF-8. test handles both with/without i18n-collator now.
There was a problem hiding this comment.
sorry, i would prefer to use french
we have locales in the CI
There was a problem hiding this comment.
switched to french
|
please fix |
|
GNU testsuite comparison: |
* fix(sort): Enable locale-aware collation for UTF-8 locales Fixes uutils#9148 The sort implementation had locale support infrastructure (ICU collator) but it was never being used due to the fast_lexicographic optimization bypassing all locale-aware code.
Fixes #9148 - sort is now a true drop-in replacement for
GNU sort with full locale support.
The sort implementation had all the infrastructure for
locale-aware collation (ICU collator) but it was never
being used due to a bug in the
fast_lexicographicoptimization that bypassed locale-aware code even in
UTF-8 locales.
Root Cause
The
can_use_fast_lexicographic()function determinedwhen to use a "fast path" (simple byte comparison) but
never checked the locale. Even in UTF-8 locales like
en_US.UTF-8, it would returntrueand skip alllocale-aware collation.
Changes
Modified
can_use_fast_lexicographic()to checklocale encoding
falsefor UTF-8 locales to forcelocale_cmp()usageInitialize ICU collator with
AlternateHandling::Shiftedbehavior exactly
Enable i18n features in sort's Cargo.toml
i18n-commonandi18n-collatortodependencies
[features]section to propagate featureflag
Test Results
Perfect match with GNU sort for all locales:
Before (broken):