Skip to content

fmt : handle invalid UTF-8 input by replacing malformed sequences#9329

Merged
sylvestre merged 18 commits intouutils:mainfrom
mattsu2020:fmt_compatibility
Jan 13, 2026
Merged

fmt : handle invalid UTF-8 input by replacing malformed sequences#9329
sylvestre merged 18 commits intouutils:mainfrom
mattsu2020:fmt_compatibility

Conversation

@mattsu2020
Copy link
Contributor

@mattsu2020 mattsu2020 commented Nov 19, 2025

escription
This PR makes fmt tolerant of invalid UTF-8 without dropping lines, while preserving the original bytes.

Previously, fmt relied on BufRead::lines(), which errors on invalid UTF-8 and could stop iteration or skip lines.

This change reads lines with read_until into a byte buffer and processes them directly. UTF-8 decoding is only used where needed for display width; invalid bytes are treated as width 1, and the raw input bytes are passed through without replacement.

@mattsu2020
Copy link
Contributor Author

fix
fmt non-space
#9127

@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

buf.pop();
}
}
let n = String::from_utf8_lossy(&buf).into_owned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using from_utf8_lossy is incorrect.

If you look at the output of GNU fmt, you will see that they don't do a lossy conversion:

$ printf "=\xA0=" | fmt -s -w1 | hexdump -X
0000000  3d  a0  3d  0a                                                
0000004

And our output:

 printf "=\xA0=" | cargo run -q fmt -s -w1 | hexdump -X
0000000  3d  ef  bf  bd  3d  0a                                        
0000006

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

- Changed `indent_str` field in `BreakArgs` to `indent: &[u8]` to avoid repeated UTF-8 conversions.
- Updated `write_all` calls to pass `&s` instead of `s.as_bytes()` in fmt.rs and similar string/byteslicing in linebreak.rs.
- Modified method signatures in parasplit.rs to accept `&[u8]` instead of `&str` for prefix matching, ensuring consistent byte-level operations without assuming valid UTF-8.
- Updated indentation calculation in FileLines to use is_some_and for tab and character checks,
  avoiding unnecessary computations and improving code flow.
- Changed punctuation checks in WordSplit iterator to use is_some_and for cleaner,
  more idiomatic Rust code.
- This refactor enhances readability and leverages short-circuiting behavior.
…line

Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 19, 2025

Merging this PR will not alter performance

✅ 141 untouched benchmarks
⏩ 38 skipped benchmarks1


Comparing mattsu2020:fmt_compatibility (9182df4) with main (0d403a4)

Open in CodSpeed

Footnotes

  1. 38 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

…hrough

Updated test_fmt_invalid_utf8 to expect raw byte (\xA0) passthrough instead of replacement character (\u{FFFD}) for invalid UTF-8 input, ensuring GNU-compatible behavior in fmt. This fixes the test expectation to match actual output, avoiding lossy conversion.
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/overlay-headers (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

}
}

fn utf8_char_width(byte: u8) -> Option<usize> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a rustdoc comment for this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read this I can see what the function is "doing" in that it gets the utf8 character character width, but I think the explanation that's missing is why that byte table lookup corresponds to the utf8 character width.

}
}

fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add

if is_fmt_whitespace(x) {
true
let mut idx = word_start;
let mut last_ascii = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move that block into a function ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

…uplication

Extracted the logic for scanning the end of a word into a new `scan_word_end` method within the `WordSplit` implementation. This refactoring removes duplicated code from the `Iterator` implementation, improving maintainability and reducing redundancy. Additionally, added documentation comments to `utf8_char_width` and `decode_char` functions for better clarity.
@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

}
}

fn byte_display_width(bytes: &[u8]) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document it too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add

…lculation

- Added a new function to compute display width for UTF-8 byte slices.
- Treats invalid bytes as width 1 to handle malformed input gracefully.
- Improves text formatting accuracy in the fmt utility.
@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!
Note: The gnu test tests/id/smack was skipped on 'main' but is now failing.
Note: The gnu test tests/mkdir/smack-no-root was skipped on 'main' but is now failing.
Note: The gnu test tests/mkdir/smack-root was skipped on 'main' but is now failing.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

}

/// Decode a UTF-8 character starting at `start`, returning the char and bytes consumed.
fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already import the bstr library here that has all of these functions that are being added as a built in bstr::decode will return a (Option, usize) and then we don't need all of this additional code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, its actually a workspace dependency not a package of this utility already, that would be.a choice up to the maintainers then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it could be simplified with bstr.
Since it's regularly maintained, it should be safe to use.

@ChrisDryden
Copy link
Collaborator

Mind updating the PR description now since the PR is mainly doing a pass through instead of doing replacements.


/// Decode a UTF-8 character starting at `start`, returning the char and bytes consumed.
fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) {
let first = bytes[start];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential out-of-bounds access: bytes[start] is accessed without bounds checking

just for safety

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

matches!(b, b'!' | b'.' | b'?')
}

fn scan_word_end(&self, word_start: usize) -> (usize, usize, Option<u8>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it duplicates some logic from decode_char, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

… UTF-8 char handling

- Add DecodedCharInfo struct to bundle char, consumed bytes, width, and ASCII flag
- Add decode_char_info() function to compute and return DecodedCharInfo
- Refactor byte_display_width(), FileLines::parse(), and WordSplit::next_word() to use decode_char_info() instead of direct decode_char() calls, improving code organization and reducing duplication
@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/fmt/non-space is no longer failing!

…r clarity

Replace match-based byte range checks with explicit constants and if-else logic, adding Unicode Standard and RFC 3629 references to improve readability and maintainability of UTF-8 sequence length detection.
Removes unnecessary blank lines after the utf8_char_width function for cleaner code formatting.
@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tty/tty-eof (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/fmt/non-space is no longer failing!

@sylvestre sylvestre merged commit 574f6ba into uutils:main Jan 13, 2026
148 of 149 checks passed
@mattsu2020 mattsu2020 deleted the fmt_compatibility branch January 13, 2026 23:08
mattsu2020 added a commit to mattsu2020/coreutils that referenced this pull request Jan 23, 2026
…tils#9329)

---------

Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants