-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
apologies if this isn't a bug but I assert still surprising behaviour: When pushing predicates down to parquet reads, using the "or" | syntax returns in pretty fast as expected in 5s. but the read with equivalent isin() predicate takes over an order of magnitude longer- 140s.
import pandas as pd
import pyarrow.dataset as ds
import s3fs
from contexttimer import Timer #pip install or use your own timer
fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network share will exhibit this as well
rawpath = f'nyc-taxi-test/weather_sorted.parquet' # sorted by longitude to accentuate the issue
filters = [
(ds.field("longitude") == -10.8) | (ds.field("longitude") == -11.4), # 5s
ds.field("longitude").isin([-11.4, -10.8]), # 143s
(ds.field("longitude") == 10.2) | (ds.field("longitude") == 10.5), # 9s
ds.field("longitude").isin([10.2, 10.5]), # 135s
None, # no filter baseline #150s
]
for filter in filters:
with Timer() as t:
with fs.open(rawpath, 'rb') as f:
df = pd.read_parquet(f, filters=filter)
print('time: ', t, 'size: ', len(df))
"""
time: 5.372 size: 26353
time: 143.137 size: 26353
time: 9.685 size: 59565
time: 135.809 size: 59565
time: 153.935 size: 12736802
"""
tested on Windows 10
python 3.10
pandas 2.0.2
pyarrow 12.0.1
Component(s)
Python
someaveragepunter and cjstudioz