fmt : handle invalid UTF-8 input by replacing malformed sequences by mattsu2020 · Pull Request #9329 · uutils/coreutils

mattsu2020 · 2025-11-19T08:47:55Z

escription
This PR makes fmt tolerant of invalid UTF-8 without dropping lines, while preserving the original bytes.

Previously, fmt relied on BufRead::lines(), which errors on invalid UTF-8 and could stop iteration or skip lines.

This change reads lines with read_until into a byte buffer and processes them directly. UTF-8 decoding is only used where needed for display width; invalid bytes are treated as width 1, and the raw input bytes are passed through without replacement.

…med sequences instead of dropping lines.

mattsu2020 · 2025-11-19T08:57:10Z

fix
fmt non-space
#9127

github-actions · 2025-11-19T09:07:11Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

cakebaker · 2025-11-19T10:49:41Z

src/uu/fmt/src/parasplit.rs

+                buf.pop();
+            }
+        }
+        let n = String::from_utf8_lossy(&buf).into_owned();


I think using from_utf8_lossy is incorrect.

If you look at the output of GNU fmt, you will see that they don't do a lossy conversion:

$ printf "=\xA0=" | fmt -s -w1 | hexdump -X 0000000 3d a0 3d 0a 0000004

And our output:

printf "=\xA0=" | cargo run -q fmt -s -w1 | hexdump -X 0000000 3d ef bf bd 3d 0a 0000006

- Changed `indent_str` field in `BreakArgs` to `indent: &[u8]` to avoid repeated UTF-8 conversions. - Updated `write_all` calls to pass `&s` instead of `s.as_bytes()` in fmt.rs and similar string/byteslicing in linebreak.rs. - Modified method signatures in parasplit.rs to accept `&[u8]` instead of `&str` for prefix matching, ensuring consistent byte-level operations without assuming valid UTF-8.

- Updated indentation calculation in FileLines to use is_some_and for tab and character checks, avoiding unnecessary computations and improving code flow. - Changed punctuation checks in WordSplit iterator to use is_some_and for cleaner, more idiomatic Rust code. - This refactor enhances readability and leverages short-circuiting behavior.

…line Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.

codspeed-hq · 2025-11-19T11:47:33Z

Merging this PR will not alter performance

✅ 141 untouched benchmarks
⏩ 38 skipped benchmarks¹

_{Comparing mattsu2020:fmt_compatibility (9182df4) with main (0d403a4)}

38 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

github-actions · 2025-11-19T11:53:19Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

…hrough Updated test_fmt_invalid_utf8 to expect raw byte (\xA0) passthrough instead of replacement character (\u{FFFD}) for invalid UTF-8 input, ensuring GNU-compatible behavior in fmt. This fixes the test expectation to match actual output, avoiding lossy conversion.

github-actions · 2025-11-19T12:15:24Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

github-actions · 2025-12-15T12:27:01Z

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/overlay-headers (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

github-actions · 2025-12-24T12:16:06Z

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

github-actions · 2025-12-27T07:58:36Z

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

sylvestre · 2025-12-27T10:19:09Z

src/uu/fmt/src/parasplit.rs

    }
 }

+fn utf8_char_width(byte: u8) -> Option<usize> {


please add a rustdoc comment for this function

When I read this I can see what the function is "doing" in that it gets the utf8 character character width, but I think the explanation that's missing is why that byte table lookup corresponds to the utf8 character width.

sylvestre · 2025-12-27T10:19:21Z

src/uu/fmt/src/parasplit.rs

+    }
+}
+
+fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) {


sylvestre · 2025-12-27T10:21:40Z

src/uu/fmt/src/parasplit.rs

-            if is_fmt_whitespace(x) {
-                true
+        let mut idx = word_start;
+        let mut last_ascii = None;


maybe move that block into a function ?

…uplication Extracted the logic for scanning the end of a word into a new `scan_word_end` method within the `WordSplit` implementation. This refactoring removes duplicated code from the `Iterator` implementation, improving maintainability and reducing redundancy. Additionally, added documentation comments to `utf8_char_width` and `decode_char` functions for better clarity.

github-actions · 2025-12-27T10:47:12Z

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

sylvestre · 2025-12-27T11:06:41Z

src/uu/fmt/src/parasplit.rs

+    }
+}
+
+fn byte_display_width(bytes: &[u8]) -> usize {


Please document it too

…lculation - Added a new function to compute display width for UTF-8 byte slices. - Treats invalid bytes as width 1 to handle malformed input gracefully. - Improves text formatting accuracy in the fmt utility.

github-actions · 2025-12-29T02:09:51Z

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!
Note: The gnu test tests/id/smack was skipped on 'main' but is now failing.
Note: The gnu test tests/mkdir/smack-no-root was skipped on 'main' but is now failing.
Note: The gnu test tests/mkdir/smack-root was skipped on 'main' but is now failing.

github-actions · 2026-01-05T11:17:42Z

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

ChrisDryden · 2026-01-07T00:25:12Z

src/uu/fmt/src/parasplit.rs

+}
+
+/// Decode a UTF-8 character starting at `start`, returning the char and bytes consumed.
+fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) {


We already import the bstr library here that has all of these functions that are being added as a built in bstr::decode will return a (Option, usize) and then we don't need all of this additional code

Whoops, its actually a workspace dependency not a package of this utility already, that would be.a choice up to the maintainers then.

It seems like it could be simplified with bstr.
Since it's regularly maintained, it should be safe to use.

ChrisDryden · 2026-01-07T00:33:03Z

Mind updating the PR description now since the PR is mainly doing a pass through instead of doing replacements.

sylvestre · 2026-01-09T20:07:35Z

src/uu/fmt/src/parasplit.rs

+
+/// Decode a UTF-8 character starting at `start`, returning the char and bytes consumed.
+fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) {
+    let first = bytes[start];


Potential out-of-bounds access: bytes[start] is accessed without bounds checking

just for safety

sylvestre · 2026-01-09T20:09:37Z

src/uu/fmt/src/parasplit.rs

+        matches!(b, b'!' | b'.' | b'?')
+    }
+
+    fn scan_word_end(&self, word_start: usize) -> (usize, usize, Option<u8>) {


it duplicates some logic from decode_char, no?

… UTF-8 char handling - Add DecodedCharInfo struct to bundle char, consumed bytes, width, and ASCII flag - Add decode_char_info() function to compute and return DecodedCharInfo - Refactor byte_display_width(), FileLines::parse(), and WordSplit::next_word() to use decode_char_info() instead of direct decode_char() calls, improving code organization and reducing duplication

github-actions · 2026-01-10T05:47:48Z

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

…r clarity Replace match-based byte range checks with explicit constants and if-else logic, adding Unicode Standard and RFC 3629 references to improve readability and maintainability of UTF-8 sequence length detection.

Removes unnecessary blank lines after the utf8_char_width function for cleaner code formatting.

github-actions · 2026-01-12T01:52:32Z

GNU testsuite comparison:

Skip an intermittent issue tests/tty/tty-eof (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

…tils#9329) --------- Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>

mattsu2020 added 3 commits November 19, 2025 16:02

test: add word joiner and cyrillic kha character tests for fmt

073f7fc

feat: Enhance fmt to handle invalid UTF-8 input by replacing malfor…

36a01a1

…med sequences instead of dropping lines.

chore: add FFFD to spell-checker ignore list in fmt test.

2c617d4

cakebaker reviewed Nov 19, 2025

View reviewed changes

mattsu2020 added 3 commits November 19, 2025 20:16

style(fmt): compact whitespace check in WordSplit iterator to single …

c59f1bc

…line Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.

Merge branch 'main' into fmt_compatibility

fee09e6

Merge branch 'main' into fmt_compatibility

e36cf9d

ChrisDryden mentioned this pull request Dec 24, 2025

GNU coreutils 9.9: Detailed Test report 12/19 #9729

Closed

Merge branch 'main' into fmt_compatibility

94efbb4

sylvestre reviewed Dec 27, 2025

View reviewed changes

feat(fmt): add byte_display_width function for UTF-8 display width ca…

6348929

…lculation - Added a new function to compute display width for UTF-8 byte slices. - Treats invalid bytes as width 1 to handle malformed input gracefully. - Improves text formatting accuracy in the fmt utility.

sylvestre and others added 2 commits December 29, 2025 13:40

Merge branch 'main' into fmt_compatibility

8ece9f5

Merge branch 'main' into fmt_compatibility

d2fa979

Merge branch 'main' into fmt_compatibility

637bf76

mattsu2020 requested review from cakebaker and sylvestre January 5, 2026 10:52

ChrisDryden reviewed Jan 7, 2026

View reviewed changes

sylvestre reviewed Jan 9, 2026

View reviewed changes

mattsu2020 added 2 commits January 12, 2026 10:41

style(fmt): remove extra blank lines in parasplit.rs

9182df4

Removes unnecessary blank lines after the utf8_char_width function for cleaner code formatting.

sylvestre merged commit 574f6ba into uutils:main Jan 13, 2026
148 of 149 checks passed

mattsu2020 deleted the fmt_compatibility branch January 13, 2026 23:08

mattsu2020 added a commit to mattsu2020/coreutils that referenced this pull request Jan 23, 2026

fmt : handle invalid UTF-8 input by replacing malformed sequences (uu…

87a4534

…tils#9329) --------- Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>

moonfruit mentioned this pull request Feb 3, 2026

uutils-selected 0.6.0 moonfruit/homebrew-tap#453

Closed

Uh oh!

Conversation

mattsu2020 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattsu2020 commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisDryden commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 10, 2026

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mattsu2020 commented Nov 19, 2025 •

edited

Loading

codspeed-hq bot commented Nov 19, 2025 •

edited

Loading