MajorTOM "images-in-parquet" vs Rasteret collection bench discussion #18
print-sid8
started this conversation in
General
Replies: 1 comment 1 reply
-
|
Cool ! The benchmark code looks all to me too :) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @lhoestq as discussed on twitter/X , ive put up a sample rasteret collection on HF https://huggingface.co/datasets/terrafloww/rasteret-collection-major-tom-style
It also has a copy of the scripts used to generate Rasteret collection parquets and the actual benchmark code that was run.
The original dataset I am comparing against is here - https://huggingface.co/datasets/Major-TOM/Core-S2L2A
The perf numbers still hold at ~4x for 100 patches, and 7x for 1000 patches of sentinel-2 satellite imagery.
I've added a
random_seedparam to benchmark script, to make sure I'm not hitting the same parquet row groups again.When the same
random_seedis run twice the perf difference drop from 4x/7x to 1.8x/2.8x faster for Rasteret, which I am guessing is explained by the XET storage doing smart caching and ofcrs CDN and other factors.I made some corrections in benchmark script on how HF was being queried, it was already using
streamingbefore, but not all pyarrow filters and pushdowns were used and also wasn't avoiding columns I don't need.My view is that difference in speed of getting satellite image patches is down to the difference in design of solutions used.
I see no bottlenecks from HF itself, nor is there bottlenecks in the image decoding done after the original data has arrived over network and is converted to numpy using rasterio MemoryFile.
I am happy to hear your thoughts.
Please inform me if you find any discrepancies in how I am using HF datasets library.
Beta Was this translation helpful? Give feedback.
All reactions