Skip to content

Commit 54513ef

Browse files
committed
release(v0.3.4): unify IO engine, add HF loading, harden TorchGeo
Unify RasterAccessor tile reads across get_xarray/get_numpy/get_gdf/TorchGeo. Simplify COGReader fetch->decompress->crop path and remove duplicate read paths. Add HuggingFace collection loading via hf://datasets/<org>/<repo> parquet shards. Harden TorchGeo edge chips with positive-overlap filtering and nodata fallback. Improve error surfacing for partial reads and unsupported geometry inputs. Fix sample_points tile-validation edge case and get_xarray xr_combine plumbing. Expand tests for execution paths, HF integration, TorchGeo error propagation, and API surface. Signed-off-by: print-sid8 <sidsub94@gmail.com>
1 parent d81a50a commit 54513ef

19 files changed

+2180
-280
lines changed

README.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ Key Features -
3838
- **Zero data downloads** - work with terabytes of imagery while storing only megabytes of metadata.
3939
- **No STAC at training time** - query once at setup; zero API calls during training with Collection you can extend.
4040
- **Reproducible** - same Parquet index = same records = same results
41-
- **Point sampling built in** - `sample_points()` returns Arrow-native tables for feature pipelines and large point sets
4241
- **Native dtypes** - integer imagery stays integer; missing/edge coverage is represented via fill values (nodata or 0) instead of NaNs
4342
- **Shareable cache** - enrich our Collection with your ML splits, patch geometries, custom data points for ML, and share it, don't write folders of image chips!
4443

@@ -188,35 +187,26 @@ ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)
188187
arr = collection.get_numpy(
189188
geometries=(77.55, 13.01, 77.58, 13.08),
190189
bands=["B04", "B08"],
191-
all_touched=False, # rasterio default masking semantics
192190
)
193191
# shape: [N, C, H, W] for multi-band, [N, H, W] for single-band
194192
```
195193

196-
### Point values (Arrow-native)
194+
### Point sampling
197195

198196
```python
199-
import duckdb
200-
201-
# Keep your table in DuckDB and pass coordinate columns directly
202-
points = duckdb.sql("""
203-
SELECT lon, lat
204-
FROM read_parquet('points.parquet')
205-
""").arrow().read_all()
197+
from shapely.geometry import Point
206198

207199
samples = collection.sample_points(
208-
points=points, # query input table/array of point locations
209-
x_column="lon",
210-
y_column="lat",
200+
points=[Point(77.56, 13.03), Point(77.57, 13.04)],
211201
bands=["B04", "B08"],
212202
geometry_crs=4326,
213-
match="latest", # or "all" for full time series matches
214203
)
215-
# pyarrow.Table with point_index, record_id, datetime, band, value, point_crs, raster_crs
204+
# PyArrow Table — one row per (point, band, record)
216205
```
217206

218-
Collection-centric loop:
219-
`build/load/as_collection -> subset/where -> get_xarray/get_numpy/sample_points/to_torchgeo_dataset`.
207+
Reads only the tiles containing your points. Works with Shapely points,
208+
or pass a PyArrow table with coordinate columns for millions of points.
209+
No extras needed — available in the base install.
220210

221211
<details>
222212
<summary><strong>Going further</strong></summary>
@@ -225,18 +215,21 @@ Collection-centric loop:
225215
|---|---|
226216
| Datasets not in the catalog | [`build_from_stac()`](https://terrafloww.github.io/rasteret/how-to/collection-management/) |
227217
| Parquet with COG URLs (Source Cooperative, STAC GeoParquet, custom) | [`build_from_table(path, name=...)`](https://terrafloww.github.io/rasteret/how-to/build-from-parquet/) |
218+
| Sample values at many points (Arrow-native) | [`sample_points()`](https://terrafloww.github.io/rasteret/how-to/point-sampling-and-masking/) |
228219
| Multi-band COGs (AEF embeddings, etc.) | [AEF Embeddings guide](https://terrafloww.github.io/rasteret/how-to/aef-embeddings/) |
229220
| Authenticated sources (PC, requester-pays, Earthdata, etc.) | [Custom Cloud Provider](https://terrafloww.github.io/rasteret/how-to/custom-cloud-provider/) |
230221
| Share a Collection | `collection.export("path/")` then `rasteret.load("path/")` |
231222
| Filter by cloud cover, date, bbox | [`collection.subset()`](https://terrafloww.github.io/rasteret/how-to/collection-management/) |
232-
| Sample values for large point sets | [`collection.sample_points()`](https://terrafloww.github.io/rasteret/how-to/point-sampling-and-masking/) |
233223

234224
</details>
235225

236226
---
237227

238228
## Benchmarks
239229

230+
<details>
231+
<summary><strong>Single request performance (time series query)</strong></summary>
232+
240233
### Single request performance
241234

242235
Processing pipeline: Filter 450,000 scenes -> 22 matches -> Read 44 COG files
@@ -253,6 +246,8 @@ Run on AWS t3.xlarge (4 CPU) —
253246
| **Rasteret** | 3 s | 3 s |
254247
| **Google Earth Engine** | 10–30 s | 3–5 s |
255248

249+
</details>
250+
256251
### Cold-start comparison with TorchGeo
257252

258253
Same AOIs, same scenes, same sampler, same DataLoader. Both paths output
@@ -274,6 +269,9 @@ for full methodology.
274269
![Processing time comparison](./assets/benchmark_results.png)
275270
![Speedup breakdown](./assets/benchmark_breakdown.png)
276271

272+
<details>
273+
<summary><strong>HF baseline (payload-Parquet patches)</strong></summary>
274+
277275
### HF `datasets` baseline (Major TOM keyed patches)
278276

279277
Baseline method: `datasets.load_dataset(..., streaming=True, filters=...)` with
@@ -302,6 +300,8 @@ Notebook: [`05_torchgeo_comparison.ipynb`](docs/tutorials/05_torchgeo_comparison
302300
> [GitHub Discussions](https://github.com/terrafloww/rasteret/discussions/categories/show-and-tell)
303301
> or [Discord](https://discord.gg/V5vvuEBc).
304302
303+
</details>
304+
305305
---
306306

307307
## Scope and stability

notebooks/05_torchgeo_comparison.ipynb

Lines changed: 127 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,21 @@
44
"cell_type": "markdown",
55
"id": "intro",
66
"metadata": {},
7-
"source": "# TorchGeo Benchmark\n\nThis notebook runs an **identical time-series workload** through two paths:\n\n- **Path A**: TorchGeo 0.9 `time_series=True`: sequential per-file reads via GDAL vsicurl\n- **Path B**: Rasteret `to_torchgeo_dataset(time_series=True)` with default backend routing\n\n**Controlled variables** (same across all paths):\n- AOI, date range, band (B04), scene count, chip size (256), batch size (2)\n- Same `RandomGeoSampler` → `DataLoader` → batch flow\n- Same output shape: `[batch, T, C, H, W]`, all timesteps stacked\n\nThe only difference is **how pixel data reaches the DataLoader**."
7+
"source": [
8+
"# TorchGeo Benchmark\n",
9+
"\n",
10+
"This notebook runs an **identical time-series workload** through two paths:\n",
11+
"\n",
12+
"- **Path A**: TorchGeo 0.9 `time_series=True`: sequential per-file reads via GDAL vsicurl\n",
13+
"- **Path B**: Rasteret `to_torchgeo_dataset(time_series=True)` with default backend routing\n",
14+
"\n",
15+
"**Controlled variables** (same across all paths):\n",
16+
"- AOI, date range, band (B04), scene count, chip size (256), batch size (2)\n",
17+
"- Same `RandomGeoSampler` → `DataLoader` → batch flow\n",
18+
"- Same output shape: `[batch, T, C, H, W]`, all timesteps stacked\n",
19+
"\n",
20+
"The only difference is **how pixel data reaches the DataLoader**."
21+
]
822
},
923
{
1024
"cell_type": "code",
@@ -177,7 +191,19 @@
177191
"cell_type": "markdown",
178192
"id": "path-b-header",
179193
"metadata": {},
180-
"source": "---\n\n## Path B: Rasteret `time_series=True`\n\nRasteret caches COG header metadata (IFD offsets, byte counts, transforms) in\na local GeoParquet index. At read time, it skips IFD parsing entirely and\nfires **concurrent** HTTP range requests for pixel data across ALL timesteps.\n\n`to_torchgeo_dataset(time_series=True)` returns a standard `GeoDataset`,\nsame samplers, same DataLoader, same `stack_samples` collate function as Path A.\nEach sample returns `[T, C, H, W]` with all timesteps stacked."
194+
"source": [
195+
"---\n",
196+
"\n",
197+
"## Path B: Rasteret `time_series=True`\n",
198+
"\n",
199+
"Rasteret caches COG header metadata (IFD offsets, byte counts, transforms) in\n",
200+
"a local GeoParquet index. At read time, it skips IFD parsing entirely and\n",
201+
"fires **concurrent** HTTP range requests for pixel data across ALL timesteps.\n",
202+
"\n",
203+
"`to_torchgeo_dataset(time_series=True)` returns a standard `GeoDataset`,\n",
204+
"same samplers, same DataLoader, same `stack_samples` collate function as Path A.\n",
205+
"Each sample returns `[T, C, H, W]` with all timesteps stacked."
206+
]
181207
},
182208
{
183209
"cell_type": "code",
@@ -303,19 +329,77 @@
303329
"cell_type": "markdown",
304330
"id": "comparison",
305331
"metadata": {},
306-
"source": "---\n\n## What's different under the hood\n\nBoth paths produce `[batch, T, C, H, W]`, all timesteps stacked per chip.\n\n| | Path A: TorchGeo | Path B: Rasteret |\n|---|---|---|\n| **Index build** | `rasterio.open()` per COG over HTTP | Pre-built GeoParquet (read from disk) |\n| **Time series read** | Sequential: one `rasterio.merge()` per timestep | All T timesteps fired concurrently |\n| **HTTP overhead per timestep** | HEAD + IFD ranges + pixel ranges | Pixel ranges only (headers cached) |\n| **Concurrency** | None; GDAL reads are serial | asyncio.gather across all T × C reads |\n\n### Where the bottleneck is\n\nTorchGeo's `_merge_or_stack` with `time_series=True`:\n```python\ndest = np.stack([rasterio.merge.merge([fh], **kwargs)[0] for fh in vrt_fhs])\n```\nEach `fh` is a `WarpedVRT` wrapping a `rasterio.open(\"/vsicurl/...\")`.\nFor cloud COGs, each `rasterio.open()` triggers HTTP HEAD + 1-3 range requests\nfor IFD headers, **all sequential, no concurrency**.\n\nFor T=15 timesteps × ~3 HTTP requests each = **45 round trips at ~100ms = 4.5s\nof pure header overhead**, before any pixel data flows.\n\nRasteret pre-caches all IFD metadata in the GeoParquet index, then fires\nT × C `read_cog()` calls via `asyncio.gather`, all concurrent."
332+
"source": [
333+
"---\n",
334+
"\n",
335+
"## What's different under the hood\n",
336+
"\n",
337+
"Both paths produce `[batch, T, C, H, W]`, all timesteps stacked per chip.\n",
338+
"\n",
339+
"| | Path A: TorchGeo | Path B: Rasteret |\n",
340+
"|---|---|---|\n",
341+
"| **Index build** | `rasterio.open()` per COG over HTTP | Pre-built GeoParquet (read from disk) |\n",
342+
"| **Time series read** | Sequential: one `rasterio.merge()` per timestep | All T timesteps fired concurrently |\n",
343+
"| **HTTP overhead per timestep** | HEAD + IFD ranges + pixel ranges | Pixel ranges only (headers cached) |\n",
344+
"| **Concurrency** | None; GDAL reads are serial | asyncio.gather across all T × C reads |\n",
345+
"\n",
346+
"### Where the bottleneck is\n",
347+
"\n",
348+
"TorchGeo's `_merge_or_stack` with `time_series=True`:\n",
349+
"```python\n",
350+
"dest = np.stack([rasterio.merge.merge([fh], **kwargs)[0] for fh in vrt_fhs])\n",
351+
"```\n",
352+
"Each `fh` is a `WarpedVRT` wrapping a `rasterio.open(\"/vsicurl/...\")`.\n",
353+
"For cloud COGs, each `rasterio.open()` triggers HTTP HEAD + 1-3 range requests\n",
354+
"for IFD headers, **all sequential, no concurrency**.\n",
355+
"\n",
356+
"For T=15 timesteps × ~3 HTTP requests each = **45 round trips at ~100ms = 4.5s\n",
357+
"of pure header overhead**, before any pixel data flows.\n",
358+
"\n",
359+
"Rasteret pre-caches all IFD metadata in the GeoParquet index, then fires\n",
360+
"T × C `read_cog()` calls via `asyncio.gather`, all concurrent."
361+
]
307362
},
308363
{
309364
"cell_type": "markdown",
310365
"id": "when-to-use",
311366
"metadata": {},
312-
"source": "## When to use which\n\n| Scenario | Recommendation |\n|----------|---------------|\n| Cloud-hosted tiled GeoTIFFs (COGs) | **Rasteret** (over 20x faster) |\n| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |\n| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |\n| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |\n\nRasteret does not replace TorchGeo - it accelerates the data loading underneath.\nFor the full ecosystem picture, see [Ecosystem Comparison](https://terrafloww.github.io/rasteret/explanation/interop/)."
367+
"source": [
368+
"## When to use which\n",
369+
"\n",
370+
"| Scenario | Recommendation |\n",
371+
"|----------|---------------|\n",
372+
"| Cloud-hosted tiled GeoTIFFs (COGs) | **Rasteret** (over 20x faster) |\n",
373+
"| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |\n",
374+
"| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |\n",
375+
"| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |\n",
376+
"\n",
377+
"Rasteret does not replace TorchGeo - it accelerates the data loading underneath.\n",
378+
"For the full ecosystem picture, see [Ecosystem Comparison](https://terrafloww.github.io/rasteret/explanation/interop/)."
379+
]
313380
},
314381
{
315382
"cell_type": "markdown",
316383
"id": "fhisivmz2xw",
317384
"metadata": {},
318-
"source": "---\n\n## Section 2: Multi-AOI Scaling\n\nThe single-AOI comparison above uses 1 region and 15 scenes. Real training\npipelines cover **multiple regions** across a full year of imagery.\n\nThis section tests: **does the speedup hold (or grow) when we scale up?**\n\n- 5 AOIs across southern India (~180 km spread)\n- Full-year date range → 30 scenes\n- Larger batch (4 chips × 16 samples)\n- CRS auto-detected from the data (no hardcoded EPSG)\n\nBoth paths use the same `RandomGeoSampler`, no `roi` constraint, so the\nsampler weights by scene area and draws chips from anywhere in the index."
385+
"source": [
386+
"---\n",
387+
"\n",
388+
"## Section 2: Multi-AOI Scaling\n",
389+
"\n",
390+
"The single-AOI comparison above uses 1 region and 15 scenes. Real training\n",
391+
"pipelines cover **multiple regions** across a full year of imagery.\n",
392+
"\n",
393+
"This section tests: **does the speedup hold (or grow) when we scale up?**\n",
394+
"\n",
395+
"- 5 AOIs across southern India (~180 km spread)\n",
396+
"- Full-year date range → 30 scenes\n",
397+
"- Larger batch (4 chips × 16 samples)\n",
398+
"- CRS auto-detected from the data (no hardcoded EPSG)\n",
399+
"\n",
400+
"Both paths use the same `RandomGeoSampler`, no `roi` constraint, so the\n",
401+
"sampler weights by scene area and draws chips from anywhere in the index."
402+
]
319403
},
320404
{
321405
"cell_type": "code",
@@ -547,13 +631,49 @@
547631
"cell_type": "markdown",
548632
"id": "w7g93r0rfgn",
549633
"metadata": {},
550-
"source": "### Multi-AOI takeaways\n\n- **CRS auto-detection**: TorchGeo infers CRS from the first file (`crs=None`).\n Rasteret derives it from the collection's `proj:epsg` metadata.\n Both expose the result via `dataset.crs`: standard TorchGeo interop.\n\n- **Geometries as array**: Rasteret's `geometries=[aoi1, aoi2, ...]` accepts\n multiple polygons in WGS84 (or any CRS via `geometries_crs=`). Internally\n each polygon is reprojected to the dataset's native CRS and unioned for\n spatial filtering. TorchGeo's `roi=` only accepts a single polygon.\n\n- **Scaling**: As T (timesteps) and the number of scenes grow, TorchGeo's\n sequential `rasterio.open()` + `rasterio.merge()` loop scales linearly.\n Rasteret's `asyncio.gather` fires all reads concurrently, bounded only\n by network bandwidth and `max_concurrent`."
634+
"source": [
635+
"### Multi-AOI takeaways\n",
636+
"\n",
637+
"- **CRS auto-detection**: TorchGeo infers CRS from the first file (`crs=None`).\n",
638+
" Rasteret derives it from the collection's `proj:epsg` metadata.\n",
639+
" Both expose the result via `dataset.crs`: standard TorchGeo interop.\n",
640+
"\n",
641+
"- **Geometries as array**: Rasteret's `geometries=[aoi1, aoi2, ...]` accepts\n",
642+
" multiple polygons in WGS84 (or any CRS via `geometries_crs=`). Internally\n",
643+
" each polygon is reprojected to the dataset's native CRS and unioned for\n",
644+
" spatial filtering. TorchGeo's `roi=` only accepts a single polygon.\n",
645+
"\n",
646+
"- **Scaling**: As T (timesteps) and the number of scenes grow, TorchGeo's\n",
647+
" sequential `rasterio.open()` + `rasterio.merge()` loop scales linearly.\n",
648+
" Rasteret's `asyncio.gather` fires all reads concurrently, bounded only\n",
649+
" by network bandwidth and `max_concurrent`.\n",
650+
"\n",
651+
"- **AOI-only sampling**: `geometries=[...]` filters which records/tiles are included in the dataset, but samplers still sample over the dataset index bounds.\n",
652+
" To restrict chips to an AOI (for example a county polygon), pass `roi=<AOI polygon in dataset CRS>` to `GridGeoSampler` / `RandomGeoSampler`.\n"
653+
]
551654
},
552655
{
553656
"cell_type": "markdown",
554657
"id": "q4r60eba4x",
555658
"metadata": {},
556-
"source": "---\n\n## Section 3: Cross-CRS Boundary (Multi-Zone Reprojection)\n\nSections 1-2 stayed within a single UTM zone (EPSG:32643). Real workflows\noften span **UTM zone boundaries**, the 78°E meridian separates zone 43N\nfrom 44N, and Sentinel-2 tiles from each zone use different CRS.\n\nThis section places an AOI in the **overlap zone** east of Hyderabad where\ntiles `43QHV` (EPSG:32643) and `44QKE` (EPSG:32644) both provide coverage.\n\n- **TorchGeo**: Uses `WarpedVRT` to reproject each file to a common CRS on read\n- **Rasteret**: Uses `target_crs=32643` to keep all scenes and reprojects via\n `rasterio.warp.reproject()` after its concurrent fetch\n\nBoth paths end at the same `[batch, T, C, H, W]` tensor in EPSG:32643."
659+
"source": [
660+
"---\n",
661+
"\n",
662+
"## Section 3: Cross-CRS Boundary (Multi-Zone Reprojection)\n",
663+
"\n",
664+
"Sections 1-2 stayed within a single UTM zone (EPSG:32643). Real workflows\n",
665+
"often span **UTM zone boundaries**, the 78°E meridian separates zone 43N\n",
666+
"from 44N, and Sentinel-2 tiles from each zone use different CRS.\n",
667+
"\n",
668+
"This section places an AOI in the **overlap zone** east of Hyderabad where\n",
669+
"tiles `43QHV` (EPSG:32643) and `44QKE` (EPSG:32644) both provide coverage.\n",
670+
"\n",
671+
"- **TorchGeo**: Uses `WarpedVRT` to reproject each file to a common CRS on read\n",
672+
"- **Rasteret**: Uses `target_crs=32643` to keep all scenes and reprojects via\n",
673+
" `rasterio.warp.reproject()` after its concurrent fetch\n",
674+
"\n",
675+
"Both paths end at the same `[batch, T, C, H, W]` tensor in EPSG:32643."
676+
]
557677
},
558678
{
559679
"cell_type": "code",

0 commit comments

Comments
 (0)