Skip to content

[C++] Acero/dataset relies on ExecBatch::ToRecordBatch truncating excess columns #33240

Description

@asfimport

As found while working on ARROW-18004: the dataset scanner and the Acero engine rely on ExecBatch::ToRecordBatch returning successfully when the given schema has fewer fields than the ExecBatch has columns.

This apparently allows to implicitly drop the dataset-added columns (kAugmentedFields in arrow/dataset/scanner.cc) from a scan's final result.

However, it seems wrong and brittle to do this implicitly at the ExecBatch::ToRecordBatch level (hiding potential errors). Instead, it should probably be done explicitly inside Acero/dataset.

Reporter: Antoine Pitrou / @pitrou

Related issues:

Note: This issue was originally created as ARROW-18037. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions