Skip to content

Pyarrow filter pushdowns#735

Merged
andygrove merged 2 commits into
apache:mainfrom
Michael-J-Ward:pyarrow-filter-pushdowns
Jun 19, 2024
Merged

Pyarrow filter pushdowns#735
andygrove merged 2 commits into
apache:mainfrom
Michael-J-Ward:pyarrow-filter-pushdowns

Conversation

@Michael-J-Ward

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #703.

Rationale for this change

The conversion for IsNull had a bug.

datafusion-python users requested pyarrow predicate pushdown support for temporal types.

What changes are included in this PR?

IsNull bug
The conversion was incorrectly passing the column-expression as an argument to the pyarrow method is_null. This would silently fail and the predicate would be excluded from the plan.

The argument should be a scalar for nan_is_null. I do not currently have a way for users to pass that in, so please suggest how I might do so.

Temporal Scalars
Similar to #731, I used ScalarValue::to_pyarrow for the scalar conversion. pyarrow filters can now accept anything that already has an upstream conversion.

Are there any user-facing changes?

A bugfix and expanded functionality.

Additional Context

I tested the predicate pushdown in two separate ways.

  1. Ensuring that explain plan contains the appropriate string.
  2. Ensuring that a query on a partitioned dataset doesn't touch the file.

Both of these seem non-ideal. If you have a suggestion for more efficiently testing this, please share!

The conversion was incorrectly passing in the expression itself as the `nan_as_null` argument. This caused the pushdown to silently fail.

@andygrove andygrove left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Checking the explain plan is a good approach IMO. We do this extensively in DataFusion and Comet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing pushdowns for pyarrow datasets

2 participants