Skip to content

Method pyarrow.parquet.read_table has memory spikes from version 0.14 #22753

Description

@asfimport

Method pyarrow.parquet.read_table is very slow and cause RAM spikes from version 0.14.0

Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x

This impact in performance is easily measured. However, there is another problem that I could only detect on htop screen. While opening a 40MB parquet, the process occupies almost 16GB for some miliseconds. The pyarrow table will result in around 300MB in the python process (registered using memory-profiler). This does not happens in versions 0.13 and previous ones.

Environment: ubuntu 18, 16GB ram, 4 cpus
Reporter: Renan Alves Fonseca

Related issues:

Note: This issue was originally created as ARROW-6380. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions