Skip to content

parquet pushdown predicate dataset.field.isin() much slower than or '|' #36283

@daddywantssugar

Description

@daddywantssugar

Describe the bug, including details regarding any error messages, version, and platform.

apologies if this isn't a bug but I assert still surprising behaviour: When pushing predicates down to parquet reads, using the "or" | syntax returns in pretty fast as expected in 5s. but the read with equivalent isin() predicate takes over an order of magnitude longer- 140s.


import pandas as pd
import pyarrow.dataset as ds
import s3fs
from contexttimer import Timer #pip install or use your own timer

fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network share will exhibit this as well
rawpath = f'nyc-taxi-test/weather_sorted.parquet' # sorted by longitude to accentuate the issue
filters = [
    (ds.field("longitude") == -10.8) | (ds.field("longitude") == -11.4), # 5s
    ds.field("longitude").isin([-11.4, -10.8]), # 143s
    
    (ds.field("longitude") == 10.2) | (ds.field("longitude") == 10.5), # 9s
    ds.field("longitude").isin([10.2, 10.5]), # 135s
    
    None, # no filter baseline #150s
]
for filter in filters:
    with Timer() as t:
        with fs.open(rawpath, 'rb') as f:
            df = pd.read_parquet(f, filters=filter)
    print('time: ', t, 'size: ', len(df))

"""
time:  5.372 size:  26353
time:  143.137 size:  26353
time:  9.685 size:  59565
time:  135.809 size:  59565 
time:  153.935 size:  12736802  
"""

tested on Windows 10
python 3.10
pandas 2.0.2
pyarrow 12.0.1

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions