-
Notifications
You must be signed in to change notification settings - Fork 147
Description
The Scan::execute function currently combines the Scan::scan_metadata() (and the needed visiting, etc.) to get the ScanFiles, then for each scan file, performs the corresponding DV and parquet read before applying transforms/schemas and eventually handing back 'final data' to the user (in the form of a ScanResult which is just filtered engine data).
The problem lies in the application of the DV to the data in the presence of predicates. currently everything is positional, that is, if we read a DV that indicates rows 1, 5, and 10 are deleted from a record batch, then Scan::execute will simply apply those positional deletions after reading the parquet file. This precludes predicate pushdown into the parquet read since the predicate effectively reshuffles the data, thereby invalidating the DV indicies.
In order to support predicate pushdown we need to
- Implement row index support within the kernel such that we can consume row indexes from parquet readers
- It's up to the parquet reader to properly implement row indexes, they can either
a. Support it - we pushdown the predicate
b. Emulate it - and depending on emulation may or may not push down predicate (?)
(note the changes need to be mirrored in the CDF path too)