Skip to content

Fix precision loss in large integral string conversions#3405

Open
fallintoplace wants to merge 4 commits into
apache:mainfrom
fallintoplace:fix-integral-string-precision
Open

Fix precision loss in large integral string conversions#3405
fallintoplace wants to merge 4 commits into
apache:mainfrom
fallintoplace:fix-integral-string-precision

Conversation

@fallintoplace
Copy link
Copy Markdown
Contributor

@fallintoplace fallintoplace commented May 22, 2026

Summary

Fixes precision loss when converting large integral strings in two runtime paths:

  • StringLiteral.to(IntegerType/LongType)
  • partition_to_py(...) for integral and time-based partition values backed by integers

Root cause

Both paths were converting through float before converting to int, which loses precision for values outside the IEEE-754 exact integer range.

That caused valid 64-bit integers like LongType.max and 9007199254740993 to be corrupted.

What changed

  • Replaced int(float(...)) with exact integer parsing in partition_to_py
  • For StringLiteral.to(IntegerType/LongType), exact integral strings now use exact integer parsing while fractional numeric strings retain the existing truncation behavior
  • Added regression tests for LongType.max and 9007199254740993

Validation

  • uv run pytest tests/expressions/test_literals.py tests/test_conversions.py

Closes #3404.

@fallintoplace fallintoplace marked this pull request as ready for review May 22, 2026 23:54
@ndrluis
Copy link
Copy Markdown
Collaborator

ndrluis commented May 24, 2026

@fallintoplace Thank you for your contribution.

Can we also handle and add tests for large integral-looking strings that are not plain integer literals, e.g. "9007199254740993.0", "9007199254740993e0", and f"{LongType.max}.0"? These still go through the float path today and either lose precision or overflow incorrectly. A test for "1e3" would also catch the scientific-notation regression from the previous behavior.

@fallintoplace
Copy link
Copy Markdown
Contributor Author

Thanks, pushed a follow-up in d443f37 to handle the remaining integral-looking string cases without going through float, and added regressions for "9007199254740993.0", "9007199254740993e0", f"{LongType.max}.0", and "1e3".

Comment thread pyiceberg/expressions/literals.py Outdated


def _truncate_numeric_string_to_int(value: str) -> int:
return int(Decimal(value))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(Decimal(value)) fixes the precision issue, but it also means we materialize the full integer before doing the bounds check. For example, literal("1e1000000").to(IntegerType()) eventually returns IntAboveMax, but only after constructing a very large Python int first. In my local check this took ~17s.

Can we compare the parsed Decimal against IntegerType.max/min or LongType.max/min before converting to int?

number = Decimal(self.value)

if number > IntegerType.max:
    return IntAboveMax()
elif number < IntegerType.min:
    return IntBelowMin()

return LongLiteral(int(number))

That should preserve the precision fix while avoiding excessive CPU/memory use for obviously out-of-range values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, pushed a follow-up in 29f7d36.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large integral string values lose precision in literal and partition conversion

2 participants