Skip to content

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

@lukecwik

Description

@lukecwik

What would you like to happen?

The current TextSource implementation is spending a lot of time during byte[] copying:
TextSource old implementation performance in pipeline

Hadoop LineReader.java implementation is signficantly faster (~2x) when handling typical files due to an implementation that reduces how many byte[]s are copied. A simple benchmark reading 10 million lines (60-120 characters long) shows that it takes about ~2.05 seconds to process such a file while the Apache Beam TextSource takes ~4.03 seconds.

Issue Priority

Priority: 2

Issue Component

Component: io-java-text

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions