Skip to content

Deprecate RecordBatchOptions::with_match_field_names#7406

Open
tustvold wants to merge 4 commits into
apache:mainfrom
tustvold:deprecate-with-match-field-names
Open

Deprecate RecordBatchOptions::with_match_field_names#7406
tustvold wants to merge 4 commits into
apache:mainfrom
tustvold:deprecate-with-match-field-names

Conversation

@tustvold

@tustvold tustvold commented Apr 11, 2025

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Noticed whilst working on #7405, this option is potentially unsound.

I did a quick scan of downstream projects and couldn't see any usage of this feature and so I think it is fine to just deprecate it. This will also potentially allow removing the rather cumbersome RecordBatchOptions.

The other option would be to make this method unsafe, but given the other checks within the various arrays I struggle to see how this would be usable reliably.

What changes are included in this PR?

Are there any user-facing changes?

Tagging @nevi-me as I think this API was last touched by you 3 or so years ago 😅

@github-actions github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 11, 2025
Comment thread arrow-array/src/record_batch.rs Outdated
#[non_exhaustive]
pub struct RecordBatchOptions {
/// Match field names of structs and lists. If set to `true`, the names must match.
#[deprecated(note = "match_field_names is unsound")]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is marked non_exhaustive so the churn should be fairly minimal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we mark it unsound, I think it would help to provide a link / reference with an explanation about why it is unsound

For example, it is not clear to me why mismatched names is unsound 🤔

RecordBatch::try_new_with_options(
schema,
columns,
&RecordBatchOptions::new().with_match_field_names(false),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this was here, as the below code will always create the correct types AFAICT - perhaps a workaround for a since fixed bug?

Comment thread arrow-array/src/record_batch.rs Outdated
}

/// Sets the `match_field_names` of `RecordBatchOptions` and returns this [`RecordBatch`]
#[deprecated(note = "match_field_names is unsound")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be there any reason it is unsound so the user can accept exact risks when using it?

@tustvold tustvold Apr 11, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I've not sat down and worked out a precise exploit chain, if I had this would flagged as a security vulnerability. However, it is breaking a pretty fundamental invariant that is assumed in a number of places. The worst it is probably going to do is cause something to panic, or produce invalid output, but the potential is there and I'd sleep happier not having it being used 😆

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question above -- if we are going to claim something is unsound I think we should justify why and provide some hints for an alternative

@tustvold

tustvold commented Apr 11, 2025

Copy link
Copy Markdown
Contributor Author

Fair, I have softened the wording. For a simple example of the unpredictable behaviour of this

let schema = Arc::new(Schema::new(vec![Field::new_list(
    "a",
    Field::new("item", DataType::Boolean, true),
    true,
)]));
let col = Arc::new(ListArray::new_null(
    Arc::new(Field::new("bananas", DataType::Boolean, true)),
    2,
));

RecordBatch::try_new(schema.clone(), vec![col.clone()]).unwrap_err();

let options = RecordBatchOptions::default().with_match_field_names(false);
let batch =
    RecordBatch::try_new_with_options(schema.clone(), vec![col.clone()], &options).unwrap();

// This panics
batch.project(&[0]).unwrap();

// This panics
StructArray::from(batch).to_data().validate().unwrap()

If one extends this to IPC it gets wilder

let mut buf = Vec::new();
let mut writer = crate::writer::FileWriter::try_new(&mut buf, batch.schema_ref()).unwrap();
writer.write(&batch).unwrap();
writer.finish().unwrap();

let mut reader = FileReader::try_new(std::io::Cursor::new(buf), None).unwrap();
let out = reader.next().unwrap().unwrap();
assert_eq!(batch, out);

This fails with an incomprehensible assertion failure, as the display implementation assumes the field is consistent.

assertion `left == right` failed
  left: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: List(Field { name: "item", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  null,
  null,
]], row_count: 2 }
 right: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: List(Field { name: "item", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  null,
  null,
]], row_count: 2 }

Ultimately lots of places make assumptions about schema consistency, and I struggle to come up with a coherent way to use this API.

provide some hints for an alternative

I'm not sure there is a viable alternative, this is fundamental property of arrow that we can't really fudge around as appealing as that might be were it possible.

@comphead

comphead commented Apr 11, 2025

Copy link
Copy Markdown
Contributor

Thanks @tustvold for experimenting with it. Probably having attached a link to your detailed comment above to the deprecation notice would be explanatory

@github-actions

Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants