Skip to content

Float parsing corrupted when a long decimal number extends ≥13 bytes past a 64-byte chunk boundary #455

@raman-39k

Description

@raman-39k

simd_json::from_slice silently produces a wrong f64 when a decimal number starts near the end of a 64-byte SIMD chunk and enough of the number's tail (≥ 13 bytes) spills into the next chunk. No parse error is returned.


Minimal reproducer

use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct Level { price: f64, amount: f64 }

#[derive(Debug, Deserialize)]
struct Book { asks: Vec<Level> }

fn main() {
    // "3077999.0000000000000000" (24 bytes) starts at byte offset 56.
    // Bytes 56-63 (8 bytes) are in chunk 0; the remaining 16 bytes spill into chunk 1.
    let json: &[u8] =
        br#"{"asks":[{"price":3077990.0,"amount":0.111111},{"price":3077999.0000000000000000,"amount":0.5}]}"#;

    assert_eq!(json[56], b'3');
    assert_eq!(&json[56..80], b"3077999.0000000000000000");

    let mut buf = json.to_vec();
    let book: Book = simd_json::from_slice(&mut buf).unwrap();

    let simd_price = book.asks[1].price;
    let serde: Book = serde_json::from_slice(json).unwrap();
    let serde_price = serde.asks[1].price;

    println!("simd-json : {simd_price}");  // 1082.0885052467904  ← WRONG
    println!("serde_json: {serde_price}"); // 3077999             ← correct
    assert_eq!(simd_price, serde_price, "simd-json parsed the wrong value");
}

Output (v0.14.3 and v0.17.0):

simd-json : 1082.0885052467904
serde_json: 3077999
thread 'main' panicked: simd-json parsed the wrong value

Exact trigger condition

The bug fires when a number starts at offset N within a 64-byte chunk and
(N + number_length) - 64 >= 13 — i.e. at least 13 bytes of the number extend
into the next chunk. Sweeping number lengths at offset 56 (8 bytes remaining in chunk):

Length Bytes past boundary simd-json result
9–20 1–12 ✅ correct
21 13 1233324.59 (wrong)
22 14 126519.95 (wrong)
23 15 15839.48 (wrong)
24 16 1082.09 (wrong)

Non-deterministic: bug does not fire at every boundary crossing

The trigger condition above is necessary but not always sufficient. In a multi-level JSON
payload (dozens of entries after the buggy number), the same number at mod-64 = 57
(7 bytes of the integer visible before the boundary) sometimes parses correctly even
though 17 bytes extend past the chunk boundary (17 ≥ 13).

Observed in a reproducer with 8 levels where two entries landed at mod-64 = 57
(offset % 64 == 57). The snippet below shows this side-by-side with mod-64 = 58,
which fails consistently:

use serde::Deserialize;
#[derive(Debug, Deserialize)]
struct Level { price: f64, amount: f64 }
#[derive(Debug, Deserialize)]
struct Book { asks: Vec<Level> }

fn check(label: &str, json: &[u8], idx: usize, expected: f64) {
    let mut buf = json.to_vec();
    let book: Book = simd_json::from_slice(&mut buf).unwrap();
    let got = book.asks[idx].price;
    let status = if got == expected { "✅ correct" } else { "❌ WRONG" };
    println!("{label}: got {got}  ({status})");
}

fn main() {
    // mod-64 = 57: complete integer part "3077999" (7 bytes) is in chunk 0,
    // the dot is the first byte of chunk 1 — 17 bytes spill (≥ 13), yet
    // simd-json parses correctly when the payload is long enough.
    //
    // Prefix is 57 bytes: {"asks":[{"price":3077990.0,"amount":0.1111111},{"price":
    //                      ^------- 57 bytes --------^
    let json_mod57: &[u8] =
        br#"{"asks":[{"price":3077990.0,"amount":0.1111111},{"price":3077999.0000000000000000,"amount":0.5},{"price":3077990.0,"amount":0.1},{"price":3077990.0,"amount":0.1},{"price":3077990.0,"amount":0.1}]}"#;
    assert_eq!(json_mod57[57], b'3', "number must start at offset 57");
    assert_eq!(57 % 64, 57);
    check("mod-64=57 (self-corrects)", json_mod57, 1, 3077999.0);

    // mod-64 = 58: only 6 bytes visible ("307799", truncated), 18 bytes spill —
    // fails reliably regardless of surrounding payload length.
    //
    // Prefix is 58 bytes: {"asks":[{"price":3077990.0,"amount":0.11111111},{"price":
    //                      ^-------- 58 bytes ---------^
    let json_mod58: &[u8] =
        br#"{"asks":[{"price":3077990.0,"amount":0.11111111},{"price":3077999.0000000000000000,"amount":0.5},{"price":3077990.0,"amount":0.1},{"price":3077990.0,"amount":0.1},{"price":3077990.0,"amount":0.1}]}"#;
    assert_eq!(json_mod58[58], b'3', "number must start at offset 58");
    assert_eq!(58 % 64, 58);
    check("mod-64=58 (always wrong) ", json_mod58, 1, 3077999.0);
}

Output:

mod-64=57 (self-corrects): got 3077999  (✅ correct)
mod-64=58 (always wrong) : got 1082.0885052467904  (❌ WRONG)
Level index mod-64 offset Bytes visible before boundary Bytes past boundary simd-json result
4 57 3077999 (7 — complete int) 17 ✅ correct
7 57 3077999 (7 — complete int) 17 ✅ correct

Yet in an isolated single-level JSON at offset 57, the same 24-byte number fails.

Why this matters:

  • At mod-64 = 57 the complete integer part (3077999) is visible in chunk 0; the dot
    is the first byte of chunk 1. When the surrounding JSON is long enough that simd-json
    has already loaded the next chunk as part of structural analysis, the number parser
    may land on a different internal code path and get the right answer.
  • At mod-64 = 56 the integer part plus the dot are visible (8 bytes = 3077999.),
    but 16 bytes still spill — so the bug fires reliably regardless.
  • At mod-64 = 58 only 6 bytes are visible (307799, truncated integer); 18 bytes spill
    and the bug fires reliably.

The net effect is that whether a given payload triggers the bug depends on:

  1. The start offset of the number within its 64-byte chunk.
  2. How many bytes extend past the boundary.
  3. The total size and layout of the surrounding JSON — which affects whether simd-json's
    structural analysis has already prefetched the next chunk.

This makes the bug latent and intermittent in production: a payload that has always
worked can silently break after an innocuous change to an earlier field (longer symbol
name, extra field, whitespace change) shifts a price string to a different offset.


Visual

chunk 0 (bytes 0–63):
  {"asks":[{"price":3077990.0,"amount":0.111111},{"price":3077999
                                                          ^^^^^^^^
                                                  only 8 bytes of the number visible here

chunk 1 (bytes 64–127):
  .000000000000000,"amount":0.5}]}
  ^^^^^^^^^^^^^^^^
  remaining 16 bytes — parser reads these incorrectly

Environment

  • simd-json: 0.14.3 and 0.17.0
  • OS: Linux x86_64
  • Rust: stable (1.96.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions