Description
Performance when reading columns using feather.read_table on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
Profiling the code below shows that the bottleneck is somewhere in the read_names function of pyarrow._feather.FeatherReader.
Example
Setup code:
import pandas as pd
from pyarrow import feather
rows, cols = (1_000_000, 10)
data = {f'c{c}': range(rows) for c in range(cols)}
df = pd.DataFrame(data=data)
feather.write_feather(df, 'test.feather', compression="uncompressed")
Benchmarks Arrow 9.0.0:
%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Benchmarks Arrow 6.0.0:
%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Environment: python 3.9, ubuntu 20.04
Reporter: Håkon Magne Holmen
Related issues:
Note: This issue was originally created as ARROW-17913. Please see the migration documentation for further details.
Description
Performance when reading columns using
feather.read_tableon Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.Profiling the code below shows that the bottleneck is somewhere in the
read_namesfunction ofpyarrow._feather.FeatherReader.Example
Setup code:
Benchmarks Arrow 9.0.0:
Benchmarks Arrow 6.0.0:
Environment: python 3.9, ubuntu 20.04
Reporter: Håkon Magne Holmen
Related issues:
Note: This issue was originally created as ARROW-17913. Please see the migration documentation for further details.