Skip to content

WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level#2623

Closed
xhochy wants to merge 1 commit into
apache:masterfrom
xhochy:ARROW-1796
Closed

WIP: ARROW-1796: [Python/Parquet] RowGroup filtering on file level#2623
xhochy wants to merge 1 commit into
apache:masterfrom
xhochy:ARROW-1796

Conversation

@xhochy

@xhochy xhochy commented Sep 25, 2018

Copy link
Copy Markdown
Member

No description provided.

@xhochy

xhochy commented Sep 25, 2018

Copy link
Copy Markdown
Member Author

Just as a heads-up. This is missing some more granular tests and the exact filtering converts to Pandas and back again. Changing that to work on pure Arrow tables will be a lot more work.

I will separate out some things into smaller pull requests.

@xhochy

xhochy commented Sep 25, 2018

Copy link
Copy Markdown
Member Author

cc @rgruener I think you were one of the interested people in this topic.

@rgruener rgruener left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, thanks for this!

Comment thread python/pyarrow/parquet.py
if isinstance(filters[0][0], six.string_types):
# We have encountered the situation where we have one nesting level too few:
# We have [(,,), ..] instead of [[(,,), ..]]
filters = [filters]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This automatic fix isn't my favorite though I see that it is necessary to not have a braking change in the api for ParquetDataset (with the filters argument). Perhaps though it would be better to throw an error here and have this fix in that specific case instead of allowing a wrong nesting level in all cases.

Comment thread python/pyarrow/parquet.py
inner_indexer = (ser < value) & inner_indexer
elif op == ">":
inner_indexer = (ser > value) & inner_indexer
elif op == "in":

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the operator "not in"?

Comment thread python/pyarrow/parquet.py
return min_value < val
elif op == ">":
return max_value > val
elif op == "in":

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about "not in"

Comment thread python/pyarrow/parquet.py
self.validate_schemas()

if filters:
if filters is not None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filters can now be either List[Tuple] or List[List[Tuple]] and can either filter on partitions or row groups depending on if the dataset is partitioned. Either way the current state should be mentioned in the docstring (unless that will be worked out before this PR is merged)

@emkornfield

Copy link
Copy Markdown
Contributor

@xhochy is this still WIP? Anything I can help with?

@xhochy

xhochy commented Mar 3, 2019

Copy link
Copy Markdown
Member Author

@emkornfield I'm not actively working on this at the moment. Feel free to pick this up or just to integrate parts of the PR into arrow.

@wesm

wesm commented Jun 14, 2019

Copy link
Copy Markdown
Member

Closing this for now. I'll refer to this PR in the course of implementing Scanner logic in the Datasets project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants