Skip to content

discussion: bucket[N] / truncate[N] upper bound diverges from Java reference impl #2486

@laskoviymishka

Description

@laskoviymishka

Follow-up from apache/iceberg-go#1110.

iceberg-rust uses Bucket(u32) / Truncate(u32), accepting N up to MaxUint32 (4,294,967,295). Java's reference impl uses int, capping at Integer.MAX_VALUE (2,147,483,647). iceberg-go just landed apache/iceberg-go#1110 which validates n <= math.MaxInt32 at parse time, aligning with Java. iceberg-cpp uses int32_t, so the upper bound matches Java via the type. PyIceberg uses PositiveInt with no explicit ceiling.

So among the five clients:

client type upper bound on N
Java int MaxInt32 (2,147,483,647)
iceberg-go (post-#1110) int + parse check MaxInt32
iceberg-cpp int32_t INT32_MAX
iceberg-rust u32 MaxUint32 (4,294,967,295)
PyIceberg PositiveInt unbounded

The spec doesn't bound N explicitly — Integer.MAX_VALUE only appears as a mask operand in the bucket formula (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N, not as a ceiling. So N > Integer.MAX_VALUE is undefined per-spec rather than disallowed, but Java is the de facto reference and three of the four other clients match.

Worth noting: iceberg-cpp's TransformFromString parses via ParseNumber<int32_t> (rejects overflow at parse time via the type), but the Bucket / Truncate constructors don't validate > 0, so bucket[0] parses cleanly and only fails later at runtime — same shape as pre-#1110 iceberg-go. Probably a separate small issue worth filing on cpp's side.

Doesn't bite today — no real table is plausibly partitioned with 2–4 billion buckets. But rust is currently the only client that could author a spec in (MaxInt32, MaxUint32], and such a spec would throw NumberFormatException in Java and fail to parse in iceberg-go post-#1110.

If u32 is intentional, a doc comment near Bucket / Truncate capturing the divergence would help future readers not read it as a parity gap. If it's incidental, narrowing to i32 (or a NonZero newtype) would converge with the reference impl.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions