Skip to content

GH-36388: [C++][Python] Return error from MakeArrayFromScalar on offset overflow#50024

Open
Sriniketh24 wants to merge 1 commit into
apache:mainfrom
Sriniketh24:fix/repeat-offset-overflow
Open

GH-36388: [C++][Python] Return error from MakeArrayFromScalar on offset overflow#50024
Sriniketh24 wants to merge 1 commit into
apache:mainfrom
Sriniketh24:fix/repeat-offset-overflow

Conversation

@Sriniketh24
Copy link
Copy Markdown

@Sriniketh24 Sriniketh24 commented May 23, 2026

Rationale

pyarrow.repeat (backed by MakeArrayFromScalar in C++) silently created an invalid array with negative offsets when the total data size (value_size * repetition_count) exceeded INT32_MAX for 32-bit offset types (StringType, BinaryType). The resulting array passed creation without error but failed validation with a cryptic "Negative offsets in binary array" or "non-monotonic offset" message.

What changed

Added an early overflow check in RepeatedArrayFactory::CreateOffsetsBuffer that computes the total data size in int64_t and returns Status::Invalid with an actionable error message when it would exceed the offset type's maximum. The error message suggests using large_* types (e.g. large_string, large_binary) for data exceeding 2 GB.

Are these changes tested?

Yes.

  • C++ test: TestMakeArrayFromScalarOffsetOverflow in array_test.cc — tests string, binary, and large_string scalars
  • Python test: test_repeat_offset_overflow in test_array.py — verifies pa.repeat raises ArrowInvalid on overflow

Are there any user-facing changes?

Yes. MakeArrayFromScalar (and pyarrow.repeat) now raises ArrowInvalid early with a clear error message instead of silently returning a corrupt array. This is a strictly better user experience.

Closes: #36388


This is AI-assisted work by Claude.

…n offset overflow

MakeArrayFromScalar silently created an invalid array with negative
offsets when the total data size (value_size * repetition_count)
exceeded the maximum value of the offset type. For 32-bit offset types
like StringType and BinaryType, this threshold is INT32_MAX (~2 GB).

The root cause was in CreateOffsetsBuffer where the running offset
accumulated via OffsetType addition without checking for overflow,
wrapping around to negative values.

Added an early overflow check in CreateOffsetsBuffer that computes the
total size in int64_t and compares against the offset type's maximum.
On overflow, a Status::Invalid error is returned with a message
suggesting the use of large_* types.

This is AI-assisted work by Claude.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses GH-36388 by preventing MakeArrayFromScalar (and therefore pyarrow.repeat) from silently constructing invalid binary/string arrays when the repeated total byte size would exceed the maximum representable value of 32-bit offsets, returning an Invalid status with a clearer error instead.

Changes:

  • Added an offset overflow check in RepeatedArrayFactory::CreateOffsetsBuffer for variable-size offset types.
  • Added a C++ regression test covering string/binary offset overflow cases.
  • Added a Python regression test verifying pa.repeat raises ArrowInvalid on overflow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/pyarrow/tests/test_array.py Adds Python coverage ensuring pa.repeat raises on 32-bit offset overflow.
cpp/src/arrow/array/util.cc Introduces early offset overflow detection when creating offsets for repeated variable-size values.
cpp/src/arrow/array/array_test.cc Adds a C++ regression test for MakeArrayFromScalar offset overflow behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +860 to +864
if (value_length > 0 && length_ > 0) {
int64_t total_size = static_cast<int64_t>(value_length) * length_;
if (total_size > static_cast<int64_t>(std::numeric_limits<OffsetType>::max())) {
return Status::Invalid(
"Cannot create array: total data size (", total_size,
Comment on lines +856 to +860
// Check that the total data size does not overflow the offset type.
// For 32-bit offset types (e.g. StringType, BinaryType), value_length * length_
// must fit in int32_t, otherwise the offsets wrap around and produce an invalid
// array with negative offsets.
if (value_length > 0 && length_ > 0) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][C++] pyarrow.repeat returns an invalid array when a chunked array is required.

2 participants