Implement support for reading CSV files with charsets other than UTF-8 by Rafferty97 · Pull Request #9468 · apache/arrow-rs

Rafferty97 · 2026-02-23T14:11:32Z

Implement support for reading CSV files with charsets other than UTF-8, via an optional dependency on encoding_rs and a corresponding configuration option.

Which issue does this PR close?

Closes #9465

What changes are included in this PR?

Add optional dependency on encoding_rs
Add configuration option to CSV reader called "encoding"
When an encoding is set, input data is pre-processed before being handed to csv-core

Are these changes tested?

I have written tests that exercise the decoder on windows-1252 and Shift-JIS encoded CSV files, with various batch and buffer sizes to ensure that the various buffering mechanisms are working. I'm fairly confident in the test coverage, but open to suggestions for making the tests more resilient.

Are there any user-facing changes?

The public API is only changed when the new optional feature is enabled, and even then, it's just a new optional configuration parameter.

…8, via an optional dependency on `encoding_rs` and a corresponding configuration option.

github-actions bot added the arrow Changes to the arrow crate label Feb 23, 2026

Rafferty97 force-pushed the non-utf8-csv2 branch from f03a9fa to 648da5a Compare February 24, 2026 14:19

Rafferty97 marked this pull request as ready for review February 24, 2026 14:25

Rafferty97 mentioned this pull request Feb 24, 2026

Support non-UTF-8 encoded CSV files apache/datafusion#20473

Open

Rafferty97 added 7 commits February 25, 2026 14:22

Implement support for reading CSV files with charsets other than UTF-…

7895370

…8, via an optional dependency on `encoding_rs` and a corresponding configuration option.

Remove broken debug assertion

ec3f7db

Improve tests and fix bugs

24c6ada

Add Shift-JIS tests and fix another bug

e7fbc92

Factor out the decoder buffering logic into the encoding module

ef4ff28

make imports more explicit

a5c1db3

remove debug print

24aa46b

Rafferty97 force-pushed the non-utf8-csv2 branch from 29fd008 to 24aa46b Compare February 25, 2026 03:23

Fix clippy error

9b76497

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement support for reading CSV files with charsets other than UTF-8#9468

Implement support for reading CSV files with charsets other than UTF-8#9468
Rafferty97 wants to merge 8 commits intoapache:mainfrom
Rafferty97:non-utf8-csv2

Rafferty97 commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Rafferty97 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rafferty97 commented Feb 23, 2026 •

edited

Loading