Skip to content

ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation#7000

Closed
fsaintjacques wants to merge 4 commits into
apache:masterfrom
fsaintjacques:ARROW-8065
Closed

ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation#7000
fsaintjacques wants to merge 4 commits into
apache:masterfrom
fsaintjacques:ARROW-8065

Conversation

@fsaintjacques

@fsaintjacques fsaintjacques commented Apr 21, 2020

Copy link
Copy Markdown
Contributor

This is the first part of a refactor to make Fragment accessible without a Scan operation instance. This is a breaking change. It introduces the concept of a physical schema and read schema, these concepts are analogous to Avro writer and reader schema.

  • Move ScanOptions at Fragment::Scan instead of a property
  • Refactor Dataset::GetFragments(ScanOptions) to Dataset::GetFragments(Expression)
  • Add Fragment::ReadPhysicalSchema
  • Normalize Python's {Dataset,Fragment,Scanner}.{scan,to_table,to_batches}()

@github-actions

Copy link
Copy Markdown

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Browsed through the C++ code, looked a bit more in detail at the python code.

The skipped python tests are because the reconstruction of a fragment is not fully working yet? (didn't fully understand the relation to ARROW-8318)

Just to make sure I correctly understood it:

  • A Fragment has only a physical schema, and is not aware of the dataset schema
  • But when scanning the Fragment (or through to_table), you can still specify this dataset schema (so scanning dataset vs fragments can still give the same result, i.e. basically dataset.to_table() and pa.concat([f.to_table(schema=dataset.schema) for f in dataset.get_fragments()]))

Comment thread cpp/src/arrow/dataset/dataset.h Outdated
Comment thread cpp/src/arrow/dataset/file_parquet.cc Outdated
Comment thread python/pyarrow/_dataset.pyx Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using Fragment.scan, it uses the Fragment's physical schema for the resulting table? (since the Fragment is not aware of the dataset "read" schema?)
If so, we should note that here in the docstring I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can still pass schema (which is not documented. I'll add this.

Comment thread python/pyarrow/tests/test_dataset.py Outdated
Comment thread python/pyarrow/tests/test_dataset.py Outdated

@jorisvandenbossche jorisvandenbossche Apr 21, 2020

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this but where the columns selection is passed to to_table ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via kwargs to self._scanner then forwarding kwargs to Scanner.from_fragment

@fsaintjacques fsaintjacques force-pushed the ARROW-8065 branch 2 times, most recently from 9e9e553 to 0c8f006 Compare April 21, 2020 18:17

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impressed by how surgical this change was, looking great

Comment thread cpp/src/arrow/dataset/dataset.h Outdated
Comment thread cpp/src/arrow/dataset/dataset.h Outdated
Comment thread cpp/src/arrow/dataset/dataset.h Outdated
Comment thread cpp/src/arrow/dataset/dataset.cc Outdated
Comment thread cpp/src/arrow/dataset/dataset.cc Outdated
Comment thread cpp/src/arrow/dataset/file_base.cc Outdated
Comment thread cpp/src/arrow/dataset/scanner_internal.h Outdated
Comment thread python/pyarrow/_dataset.pyx Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This coercion is common, should we have _unwrap_expression_default_true()?

Comment thread python/pyarrow/_dataset.pyx Outdated
Comment on lines 113 to 115

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Return fragments matching the optional filter, either using the
partition_expression or internal information like Parquet's
statistics.
Return fragments matching the optional filter, using explicit
partition expressions and/or embedded information like Parquet's
statistics.

Comment thread python/pyarrow/_dataset.pyx Outdated
Comment on lines 144 to 147

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It produces a stream of ScanTasks which is meant to be a unit of work
to be dispatched. The tasks are not executed automatically, the user is
responsible to execute and dispatch the individual tasks, so custom
local task scheduling can be implemented.
It produces a stream of ScanTasks. Each task is meant to be a unit of work
to be dispatched by the user; they are not executed automatically. This allows
customization of local task scheduling and execution.

@kszucs

kszucs commented Apr 22, 2020

Copy link
Copy Markdown
Member

@ursabot build

@fsaintjacques

Copy link
Copy Markdown
Contributor Author

Addressed most comments and updated followup ticket with what's missing. PTAL and merge quickly so we can unblock the blocked tickets :)

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, defer to @bkietz to approve the C++ side ;)

Comment thread cpp/src/arrow/dataset/file_parquet.cc Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants