GH-36593: [Python] Add rename_columns method to pyarrow datasets by JonatanMartens · Pull Request #48289 · apache/arrow

JonatanMartens · 2025-11-29T12:42:08Z

Rationale for this change

See #36593
In particular this change is convenient when the column names stored in a file are different from the logical names associated with the columns (see deltalake column mapping feature as an example).

What changes are included in this PR?

Adds the rename_columns method to datasets in pyarrow.
This mehod allows a user to rename columns in the data returned from a scan before actually creating a scanner object.

Are these changes tested?

This PR also add a test for the new rename_columns method using an InMemoryDataset.

Are there any user-facing changes?

Adds the rename_columns method to pyarrow datasets.

GitHub Issue: [Python] Add rename_columns to DataSet #36593

github-actions · 2025-11-29T12:42:36Z

⚠️ GitHub issue #36593 has been automatically assigned in GitHub to PR creator.

raulcd

I don't think the examples where bad you just have to add the imports for doctest to work:
>>> import pyarrow.dataset as ds as seen here:

arrow/python/pyarrow/_dataset.pyx

Lines 371 to 391 in b2e8f25

    
                   >>> import pyarrow as pa 
        
                   >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], 
        
                   ...                   'n_legs': [2, 2, 4, 4, 5, 100], 
        
                   ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse", 
        
                   ...                              "Brittle stars", "Centipede"]}) 
        
                   >>> 
        
                   >>> import pyarrow.parquet as pq 
        
                   >>> pq.write_table(table, "dataset_scanner.parquet") 
        
                   >>> import pyarrow.dataset as ds 
        
                   >>> dataset = ds.dataset("dataset_scanner.parquet") 
        
                   Selecting a subset of the columns: 
        
                   >>> dataset.scanner(columns=["year", "n_legs"]).to_table() 
        
                   pyarrow.Table 
        
                   year: int64 
        
                   n_legs: int64 
        
                   ---- 
        
                   year: [[2020,2022,2021,2022,2019,2021]] 
        
                   n_legs: [[2,2,4,4,5,100]]

JonatanMartens · 2025-12-22T12:04:51Z

@rok @raulcd @AlenkaF
It's ready to merge now, could you take a look?

AlenkaF · 2025-12-23T09:01:31Z

I am not sure about the changes in this PR, mainly because I am not very knowledgable when it comes to Acero and datasets. The functionality seems great to have, but modifying _scan_options for change of column names on read feels a bit hacky.

What do you think @rok ?

rok · 2025-12-23T09:36:37Z

The change looks good to me in principle.
I do agree with @AlenkaF that changing _scan_options seems a bit forced and could have unexpected consequences elsewhere. Can you check if there is a nicer way?

JonatanMartens · 2025-12-23T11:03:55Z

Sounds good, I'm now using a new attribute called _columns instead of relying on _scan_options
@rok @AlenkaF

JonatanMartens · 2026-01-07T10:51:52Z

@rok @AlenkaF Could you review this again?

Add rename_columns method

a2c19f3

JonatanMartens requested review from AlenkaF, raulcd and rok as code owners November 29, 2025 12:42

github-actions bot added Component: Python awaiting review Awaiting review labels Nov 29, 2025

Remove examples from docstring

04969a5

raulcd reviewed Dec 4, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 4, 2025

Add docstring with proper imports

f48a738

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 4, 2025

JonatanMartens mentioned this pull request Dec 4, 2025

feat: add column mapping support when reading tables delta-io/delta-rs#3954

Closed

Jonatan Martens added 3 commits December 4, 2025 19:52

Add outputs to examples

70e520d

Fix columns not in rename dict being ignored

5e6abfd

Fix lint errors

dad7b89

JonatanMartens requested a review from raulcd December 14, 2025 07:52

Use new columns attribute instead of scan_options

8982c9b

Remove unnecessary variable assignment

dcc5407

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36593: [Python] Add rename_columns method to pyarrow datasets#48289

GH-36593: [Python] Add rename_columns method to pyarrow datasets#48289
JonatanMartens wants to merge 8 commits intoapache:mainfrom
JonatanMartens:python-rename-columns

JonatanMartens commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

raulcd left a comment

Uh oh!

JonatanMartens commented Dec 22, 2025

Uh oh!

AlenkaF commented Dec 23, 2025

Uh oh!

rok commented Dec 23, 2025

Uh oh!

JonatanMartens commented Dec 23, 2025 •

edited

Loading

Uh oh!

JonatanMartens commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	>>> import pyarrow as pa
	>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
	... 'n_legs': [2, 2, 4, 4, 5, 100],
	... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
	... "Brittle stars", "Centipede"]})
	>>>
	>>> import pyarrow.parquet as pq
	>>> pq.write_table(table, "dataset_scanner.parquet")

	>>> import pyarrow.dataset as ds
	>>> dataset = ds.dataset("dataset_scanner.parquet")

	Selecting a subset of the columns:

	>>> dataset.scanner(columns=["year", "n_legs"]).to_table()
	pyarrow.Table
	year: int64
	n_legs: int64
	----
	year: [[2020,2022,2021,2022,2019,2021]]
	n_legs: [[2,2,4,4,5,100]]

Conversation

JonatanMartens commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

JonatanMartens commented Dec 22, 2025

Uh oh!

AlenkaF commented Dec 23, 2025

Uh oh!

rok commented Dec 23, 2025

Uh oh!

JonatanMartens commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonatanMartens commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JonatanMartens commented Nov 29, 2025 •

edited by github-actions bot

Loading

JonatanMartens commented Dec 23, 2025 •

edited

Loading