Skip to content

fix JSON decoder error checking for UTF16 / surrogate parsing panic#7721

Merged
alamb merged 2 commits into
apache:mainfrom
nicklan:fix-json-surrogate-parsing-panic
Jun 22, 2025
Merged

fix JSON decoder error checking for UTF16 / surrogate parsing panic#7721
alamb merged 2 commits into
apache:mainfrom
nicklan:fix-json-surrogate-parsing-panic

Conversation

@nicklan

@nicklan nicklan commented Jun 20, 2025

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Shouldn't panic, especially in a fallible function.

What changes are included in this PR?

Validate that the high and low surrogates are in the expected range, which guarantees that the subtractions won't overflow.

Are there any user-facing changes?

No (well, things that used to panic now won't, but I don't think that counts)

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jun 20, 2025
Comment on lines +708 to +717
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a pair of checked_sub work?

Suggested change
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}
char_from_surrogate_pair_opt(low, high).ok_or_else(|| {
ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair: {lo:#02X}, {high:#02X}"))
})
}
fn char_from_surrogate_pair_opt(low: u16, high: u16) -> Option<char> {
let high = high.checked_sub(0xD800)? as u32;
let low = low.checked_sub(0xDC00)? as u32;
char::from_u32((high << 10) | (low + 0x1_0000))

@nicklan nicklan Jun 20, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this, and it does have the benefit of being more succinct. I opted for the range check because if one of the arguments was too high the checked sub would not catch it and I wasn't able to convince myself that only valid inputs would allow char::from_u32 to succeed.

I could dig more into the calling layers to check if that's prevented there, but it still felt like more of a foot-gun than the simple range check.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16

@alamb

alamb commented Jun 21, 2025

Copy link
Copy Markdown
Contributor

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fix-json-surrogate-parsing-panic (30170ce) to fbaf7ce diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench json_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=fix-json-surrogate-parsing-panic
Results will be posted here when complete

@alamb

alamb commented Jun 21, 2025

Copy link
Copy Markdown
Contributor

Thank you @nicklan and @scovich -- I started the CI and some benchmarks on this PR

@alamb

alamb commented Jun 21, 2025

Copy link
Copy Markdown
Contributor

🤖: Benchmark completed

Details

group                                  fix-json-surrogate-parsing-panic       main
-----                                  --------------------------------       ----
large_bench_primitive                  1.00      2.3±0.01ms        ? ?/sec    1.04      2.4±0.00ms        ? ?/sec
small_bench_list                       1.00     12.9±0.03µs        ? ?/sec    1.03     13.2±0.07µs        ? ?/sec
small_bench_primitive                  1.01      6.6±0.04µs        ? ?/sec    1.00      6.5±0.04µs        ? ?/sec
small_bench_primitive_with_utf8view    1.00      6.2±0.04µs        ? ?/sec    1.04      6.5±0.09µs        ? ?/sec

@alamb

alamb commented Jun 21, 2025

Copy link
Copy Markdown
Contributor

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fix-json-surrogate-parsing-panic (30170ce) to fbaf7ce diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench json_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=fix-json-surrogate-parsing-panic
Results will be posted here when complete

@alamb

alamb commented Jun 21, 2025

Copy link
Copy Markdown
Contributor

🤖: Benchmark completed

Details

group                                  fix-json-surrogate-parsing-panic       main
-----                                  --------------------------------       ----
large_bench_primitive                  1.00      2.3±0.01ms        ? ?/sec    1.05      2.4±0.01ms        ? ?/sec
small_bench_list                       1.00     13.1±0.05µs        ? ?/sec    1.01     13.2±0.04µs        ? ?/sec
small_bench_primitive                  1.00      6.5±0.03µs        ? ?/sec    1.02      6.6±0.03µs        ? ?/sec
small_bench_primitive_with_utf8view    1.00      6.2±0.02µs        ? ?/sec    1.05      6.5±0.04µs        ? ?/sec

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nicklan and @scovich - this looks like a great change to me

Comment on lines +708 to +717
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16

}

#[test]
fn test_invalid_surrogates() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this test panic's on main:

attempt to subtract with overflow
thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:708:49:
attempt to subtract with overflow

And the test actually fails to detect an error in release mode


---- reader::tape::tests::test_invalid_surrogates stdout ----

thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:959:9:
assertion failed: res.is_err()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@alamb alamb changed the title fix surrogate parsing panic fix JSON decoder error checking for surrogate parsing panic Jun 21, 2025
@alamb alamb changed the title fix JSON decoder error checking for surrogate parsing panic fix JSON decoder error checking for UTF16 / surrogate parsing panic Jun 21, 2025
@alamb

alamb commented Jun 22, 2025

Copy link
Copy Markdown
Contributor

If anything this seems faster than what is on main so it has a added bonus

@alamb alamb merged commit 2788762 into apache:main Jun 22, 2025
27 checks passed
@alamb

alamb commented Jun 22, 2025

Copy link
Copy Markdown
Contributor

Thanks again @nicklan and @scovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

In arrow_json, Decoder::decode can panic if it encounters two high surrogates in a row.

3 participants