Deprecate RecordBatchOptions::with_match_field_names#7406
Conversation
| #[non_exhaustive] | ||
| pub struct RecordBatchOptions { | ||
| /// Match field names of structs and lists. If set to `true`, the names must match. | ||
| #[deprecated(note = "match_field_names is unsound")] |
There was a problem hiding this comment.
This is marked non_exhaustive so the churn should be fairly minimal
There was a problem hiding this comment.
if we mark it unsound, I think it would help to provide a link / reference with an explanation about why it is unsound
For example, it is not clear to me why mismatched names is unsound 🤔
| RecordBatch::try_new_with_options( | ||
| schema, | ||
| columns, | ||
| &RecordBatchOptions::new().with_match_field_names(false), |
There was a problem hiding this comment.
I'm not sure why this was here, as the below code will always create the correct types AFAICT - perhaps a workaround for a since fixed bug?
| } | ||
|
|
||
| /// Sets the `match_field_names` of `RecordBatchOptions` and returns this [`RecordBatch`] | ||
| #[deprecated(note = "match_field_names is unsound")] |
There was a problem hiding this comment.
Should be there any reason it is unsound so the user can accept exact risks when using it?
There was a problem hiding this comment.
Tbh I've not sat down and worked out a precise exploit chain, if I had this would flagged as a security vulnerability. However, it is breaking a pretty fundamental invariant that is assumed in a number of places. The worst it is probably going to do is cause something to panic, or produce invalid output, but the potential is there and I'd sleep happier not having it being used 😆
There was a problem hiding this comment.
I had the same question above -- if we are going to claim something is unsound I think we should justify why and provide some hints for an alternative
|
Fair, I have softened the wording. For a simple example of the unpredictable behaviour of this If one extends this to IPC it gets wilder This fails with an incomprehensible assertion failure, as the display implementation assumes the field is consistent. Ultimately lots of places make assumptions about schema consistency, and I struggle to come up with a coherent way to use this API.
I'm not sure there is a viable alternative, this is fundamental property of arrow that we can't really fudge around as appealing as that might be were it possible. |
|
Thanks @tustvold for experimenting with it. Probably having attached a link to your detailed comment above to the deprecation notice would be explanatory |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #.
Rationale for this change
Noticed whilst working on #7405, this option is potentially unsound.
I did a quick scan of downstream projects and couldn't see any usage of this feature and so I think it is fine to just deprecate it. This will also potentially allow removing the rather cumbersome RecordBatchOptions.
The other option would be to make this method unsafe, but given the other checks within the various arrays I struggle to see how this would be usable reliably.
What changes are included in this PR?
Are there any user-facing changes?
Tagging @nevi-me as I think this API was last touched by you 3 or so years ago 😅