fmt : handle invalid UTF-8 input by replacing malformed sequences#9329
fmt : handle invalid UTF-8 input by replacing malformed sequences#9329sylvestre merged 18 commits intouutils:mainfrom
Conversation
|
fix |
|
GNU testsuite comparison: |
src/uu/fmt/src/parasplit.rs
Outdated
| buf.pop(); | ||
| } | ||
| } | ||
| let n = String::from_utf8_lossy(&buf).into_owned(); |
There was a problem hiding this comment.
I think using from_utf8_lossy is incorrect.
If you look at the output of GNU fmt, you will see that they don't do a lossy conversion:
$ printf "=\xA0=" | fmt -s -w1 | hexdump -X
0000000 3d a0 3d 0a
0000004
And our output:
printf "=\xA0=" | cargo run -q fmt -s -w1 | hexdump -X
0000000 3d ef bf bd 3d 0a
0000006
- Changed `indent_str` field in `BreakArgs` to `indent: &[u8]` to avoid repeated UTF-8 conversions. - Updated `write_all` calls to pass `&s` instead of `s.as_bytes()` in fmt.rs and similar string/byteslicing in linebreak.rs. - Modified method signatures in parasplit.rs to accept `&[u8]` instead of `&str` for prefix matching, ensuring consistent byte-level operations without assuming valid UTF-8.
- Updated indentation calculation in FileLines to use is_some_and for tab and character checks, avoiding unnecessary computations and improving code flow. - Changed punctuation checks in WordSplit iterator to use is_some_and for cleaner, more idiomatic Rust code. - This refactor enhances readability and leverages short-circuiting behavior.
…line Refactored the is_whitespace assignment by combining chained method calls on one line for improved conciseness and readability.
Merging this PR will not alter performance
Comparing Footnotes
|
|
GNU testsuite comparison: |
…hrough
Updated test_fmt_invalid_utf8 to expect raw byte (\xA0) passthrough instead of replacement character (\u{FFFD}) for invalid UTF-8 input, ensuring GNU-compatible behavior in fmt. This fixes the test expectation to match actual output, avoiding lossy conversion.
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
| } | ||
| } | ||
|
|
||
| fn utf8_char_width(byte: u8) -> Option<usize> { |
There was a problem hiding this comment.
please add a rustdoc comment for this function
There was a problem hiding this comment.
When I read this I can see what the function is "doing" in that it gets the utf8 character character width, but I think the explanation that's missing is why that byte table lookup corresponds to the utf8 character width.
| } | ||
| } | ||
|
|
||
| fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) { |
src/uu/fmt/src/parasplit.rs
Outdated
| if is_fmt_whitespace(x) { | ||
| true | ||
| let mut idx = word_start; | ||
| let mut last_ascii = None; |
There was a problem hiding this comment.
maybe move that block into a function ?
…uplication Extracted the logic for scanning the end of a word into a new `scan_word_end` method within the `WordSplit` implementation. This refactoring removes duplicated code from the `Iterator` implementation, improving maintainability and reducing redundancy. Additionally, added documentation comments to `utf8_char_width` and `decode_char` functions for better clarity.
|
GNU testsuite comparison: |
| } | ||
| } | ||
|
|
||
| fn byte_display_width(bytes: &[u8]) -> usize { |
…lculation - Added a new function to compute display width for UTF-8 byte slices. - Treats invalid bytes as width 1 to handle malformed input gracefully. - Improves text formatting accuracy in the fmt utility.
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
| } | ||
|
|
||
| /// Decode a UTF-8 character starting at `start`, returning the char and bytes consumed. | ||
| fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) { |
There was a problem hiding this comment.
We already import the bstr library here that has all of these functions that are being added as a built in bstr::decode will return a (Option, usize) and then we don't need all of this additional code
There was a problem hiding this comment.
Whoops, its actually a workspace dependency not a package of this utility already, that would be.a choice up to the maintainers then.
There was a problem hiding this comment.
It seems like it could be simplified with bstr.
Since it's regularly maintained, it should be safe to use.
|
Mind updating the PR description now since the PR is mainly doing a pass through instead of doing replacements. |
src/uu/fmt/src/parasplit.rs
Outdated
|
|
||
| /// Decode a UTF-8 character starting at `start`, returning the char and bytes consumed. | ||
| fn decode_char(bytes: &[u8], start: usize) -> (Option<char>, usize) { | ||
| let first = bytes[start]; |
There was a problem hiding this comment.
Potential out-of-bounds access: bytes[start] is accessed without bounds checking
just for safety
| matches!(b, b'!' | b'.' | b'?') | ||
| } | ||
|
|
||
| fn scan_word_end(&self, word_start: usize) -> (usize, usize, Option<u8>) { |
There was a problem hiding this comment.
it duplicates some logic from decode_char, no?
… UTF-8 char handling - Add DecodedCharInfo struct to bundle char, consumed bytes, width, and ASCII flag - Add decode_char_info() function to compute and return DecodedCharInfo - Refactor byte_display_width(), FileLines::parse(), and WordSplit::next_word() to use decode_char_info() instead of direct decode_char() calls, improving code organization and reducing duplication
|
GNU testsuite comparison: |
…r clarity Replace match-based byte range checks with explicit constants and if-else logic, adding Unicode Standard and RFC 3629 references to improve readability and maintainability of UTF-8 sequence length detection.
Removes unnecessary blank lines after the utf8_char_width function for cleaner code formatting.
|
GNU testsuite comparison: |
…tils#9329) --------- Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
escription
This PR makes fmt tolerant of invalid UTF-8 without dropping lines, while preserving the original bytes.
Previously, fmt relied on BufRead::lines(), which errors on invalid UTF-8 and could stop iteration or skip lines.
This change reads lines with read_until into a byte buffer and processes them directly. UTF-8 decoding is only used where needed for display width; invalid bytes are treated as width 1, and the raw input bytes are passed through without replacement.