MajorTOM "images-in-parquet" vs Rasteret collection bench discussion #18

print-sid8 · 2026-03-05T10:25:56Z

print-sid8
Mar 5, 2026
Maintainer

Hi @lhoestq as discussed on twitter/X , ive put up a sample rasteret collection on HF https://huggingface.co/datasets/terrafloww/rasteret-collection-major-tom-style
It also has a copy of the scripts used to generate Rasteret collection parquets and the actual benchmark code that was run.

The original dataset I am comparing against is here - https://huggingface.co/datasets/Major-TOM/Core-S2L2A

The perf numbers still hold at ~4x for 100 patches, and 7x for 1000 patches of sentinel-2 satellite imagery.
I've added a random_seed param to benchmark script, to make sure I'm not hitting the same parquet row groups again.

When the same random_seed is run twice the perf difference drop from 4x/7x to 1.8x/2.8x faster for Rasteret, which I am guessing is explained by the XET storage doing smart caching and ofcrs CDN and other factors.

I made some corrections in benchmark script on how HF was being queried, it was already using streaming before, but not all pyarrow filters and pushdowns were used and also wasn't avoiding columns I don't need.

My view is that difference in speed of getting satellite image patches is down to the difference in design of solutions used.
I see no bottlenecks from HF itself, nor is there bottlenecks in the image decoding done after the original data has arrived over network and is converted to numpy using rasterio MemoryFile.

I am happy to hear your thoughts.

Please inform me if you find any discrepancies in how I am using HF datasets library.

lhoestq · 2026-03-05T20:05:26Z

lhoestq
Mar 5, 2026

Cool ! The benchmark code looks all to me too :)

1 reply

print-sid8 Mar 6, 2026
Maintainer Author

Thanks for reviewing! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MajorTOM "images-in-parquet" vs Rasteret collection bench discussion #18

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MajorTOM "images-in-parquet" vs Rasteret collection bench discussion #18

Uh oh!

print-sid8 Mar 5, 2026 Maintainer

Replies: 1 comment · 1 reply

Uh oh!

lhoestq Mar 5, 2026

Uh oh!

print-sid8 Mar 6, 2026 Maintainer Author

print-sid8
Mar 5, 2026
Maintainer

Replies: 1 comment 1 reply

lhoestq
Mar 5, 2026

print-sid8 Mar 6, 2026
Maintainer Author