Skip to content

ARROW-9108: [C++][Dataset] Add supports for missing type in Statistics to Scalar conversion#7623

Closed
fsaintjacques wants to merge 8 commits into
apache:masterfrom
fsaintjacques:ARROW-9108-parquet-timestamp-statistics
Closed

ARROW-9108: [C++][Dataset] Add supports for missing type in Statistics to Scalar conversion#7623
fsaintjacques wants to merge 8 commits into
apache:masterfrom
fsaintjacques:ARROW-9108-parquet-timestamp-statistics

Conversation

@fsaintjacques

Copy link
Copy Markdown
Contributor

No description provided.

Comment thread python/pyarrow/tests/test_dataset.py Outdated

@fsaintjacques fsaintjacques Jul 2, 2020

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to be enabled once #7519 is merged, i.e. it is missing python Scalar classes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it should work since I rebased this branch on top of #7519 and uncommented.

@github-actions

github-actions Bot commented Jul 2, 2020

Copy link
Copy Markdown

@fsaintjacques fsaintjacques force-pushed the ARROW-9108-parquet-timestamp-statistics branch from 5a904d5 to ac58dec Compare July 2, 2020 21:09

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
(I can also look into rebasing this once the scalar PR is merged)

What happens if a certain row group column has only nulls? Is the min/max then also null, or is it not defined? (might be good to add a test case for this)

Comment thread python/pyarrow/_dataset.pyx Outdated
@nealrichardson nealrichardson force-pushed the ARROW-9108-parquet-timestamp-statistics branch from b961296 to b277e3b Compare July 6, 2020 22:32
@nealrichardson

Copy link
Copy Markdown
Member

@jorisvandenbossche #7519 has merged so I just rebased this

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-9108-parquet-timestamp-statistics branch from b277e3b to c037e0f Compare July 7, 2020 08:41
@jorisvandenbossche

Copy link
Copy Markdown
Member

OK, I now also enabled the commented-out tests

@jorisvandenbossche

Copy link
Copy Markdown
Member

What happens if a certain row group column has only nulls? Is the min/max then also null, or is it not defined? (might be good to add a test case for this)

It seems we don't write min/max statistics in such a case (has_min_max is False), leading to an empty statistics struct, and thus empty dict in python.
Added a test for this.

@nealrichardson

Copy link
Copy Markdown
Member

CI is green. @jorisvandenbossche is this good to merge?

@jorisvandenbossche

Copy link
Copy Markdown
Member

I think @bkietz is still taking a look today?

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question:

Comment thread python/pyarrow/_dataset.pyx Outdated
Comment on lines +890 to +891
# Don't treat failure to parse/convert a single Scalar as a
# failure. The min/max will simply be missing for this field.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test case for this? IIUC this will only arise when a parquet file has statistics of a type which pyarrow_wrap_scalar doesn't support; are we sure we want to hide that error?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also related question above: #7623 (comment)

I am not sure we can test it right now, as all scalars are supported in Python.

I suppose the rationale is that as a user experience it can be better to still return the statistics of all other columns, instead of raising due to one column not being recognized.
On the other hand, it indeed suppresses an error that would be useful to see (for us developers).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it'd be better to raise an exception. If conversion fails in the future for a statistics scalar which we don't yet support then recognizing that issue will be easier than figuring out why the statistics disappear.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support Scalars of all types, so it should not be possible for this to error, and if it does, it's a bug. IMO we should remove the try/except and move on.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okido, removed!

@jorisvandenbossche

Copy link
Copy Markdown
Member

Thanks @fsaintjacques !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants