|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "id": "intro", |
6 | 6 | "metadata": {}, |
7 | | - "source": "# TorchGeo Benchmark\n\nThis notebook runs an **identical time-series workload** through two paths:\n\n- **Path A**: TorchGeo 0.9 `time_series=True`: sequential per-file reads via GDAL vsicurl\n- **Path B**: Rasteret `to_torchgeo_dataset(time_series=True)` with default backend routing\n\n**Controlled variables** (same across all paths):\n- AOI, date range, band (B04), scene count, chip size (256), batch size (2)\n- Same `RandomGeoSampler` → `DataLoader` → batch flow\n- Same output shape: `[batch, T, C, H, W]`, all timesteps stacked\n\nThe only difference is **how pixel data reaches the DataLoader**." |
| 7 | + "source": [ |
| 8 | + "# TorchGeo Benchmark\n", |
| 9 | + "\n", |
| 10 | + "This notebook runs an **identical time-series workload** through two paths:\n", |
| 11 | + "\n", |
| 12 | + "- **Path A**: TorchGeo 0.9 `time_series=True`: sequential per-file reads via GDAL vsicurl\n", |
| 13 | + "- **Path B**: Rasteret `to_torchgeo_dataset(time_series=True)` with default backend routing\n", |
| 14 | + "\n", |
| 15 | + "**Controlled variables** (same across all paths):\n", |
| 16 | + "- AOI, date range, band (B04), scene count, chip size (256), batch size (2)\n", |
| 17 | + "- Same `RandomGeoSampler` → `DataLoader` → batch flow\n", |
| 18 | + "- Same output shape: `[batch, T, C, H, W]`, all timesteps stacked\n", |
| 19 | + "\n", |
| 20 | + "The only difference is **how pixel data reaches the DataLoader**." |
| 21 | + ] |
8 | 22 | }, |
9 | 23 | { |
10 | 24 | "cell_type": "code", |
|
177 | 191 | "cell_type": "markdown", |
178 | 192 | "id": "path-b-header", |
179 | 193 | "metadata": {}, |
180 | | - "source": "---\n\n## Path B: Rasteret `time_series=True`\n\nRasteret caches COG header metadata (IFD offsets, byte counts, transforms) in\na local GeoParquet index. At read time, it skips IFD parsing entirely and\nfires **concurrent** HTTP range requests for pixel data across ALL timesteps.\n\n`to_torchgeo_dataset(time_series=True)` returns a standard `GeoDataset`,\nsame samplers, same DataLoader, same `stack_samples` collate function as Path A.\nEach sample returns `[T, C, H, W]` with all timesteps stacked." |
| 194 | + "source": [ |
| 195 | + "---\n", |
| 196 | + "\n", |
| 197 | + "## Path B: Rasteret `time_series=True`\n", |
| 198 | + "\n", |
| 199 | + "Rasteret caches COG header metadata (IFD offsets, byte counts, transforms) in\n", |
| 200 | + "a local GeoParquet index. At read time, it skips IFD parsing entirely and\n", |
| 201 | + "fires **concurrent** HTTP range requests for pixel data across ALL timesteps.\n", |
| 202 | + "\n", |
| 203 | + "`to_torchgeo_dataset(time_series=True)` returns a standard `GeoDataset`,\n", |
| 204 | + "same samplers, same DataLoader, same `stack_samples` collate function as Path A.\n", |
| 205 | + "Each sample returns `[T, C, H, W]` with all timesteps stacked." |
| 206 | + ] |
181 | 207 | }, |
182 | 208 | { |
183 | 209 | "cell_type": "code", |
|
303 | 329 | "cell_type": "markdown", |
304 | 330 | "id": "comparison", |
305 | 331 | "metadata": {}, |
306 | | - "source": "---\n\n## What's different under the hood\n\nBoth paths produce `[batch, T, C, H, W]`, all timesteps stacked per chip.\n\n| | Path A: TorchGeo | Path B: Rasteret |\n|---|---|---|\n| **Index build** | `rasterio.open()` per COG over HTTP | Pre-built GeoParquet (read from disk) |\n| **Time series read** | Sequential: one `rasterio.merge()` per timestep | All T timesteps fired concurrently |\n| **HTTP overhead per timestep** | HEAD + IFD ranges + pixel ranges | Pixel ranges only (headers cached) |\n| **Concurrency** | None; GDAL reads are serial | asyncio.gather across all T × C reads |\n\n### Where the bottleneck is\n\nTorchGeo's `_merge_or_stack` with `time_series=True`:\n```python\ndest = np.stack([rasterio.merge.merge([fh], **kwargs)[0] for fh in vrt_fhs])\n```\nEach `fh` is a `WarpedVRT` wrapping a `rasterio.open(\"/vsicurl/...\")`.\nFor cloud COGs, each `rasterio.open()` triggers HTTP HEAD + 1-3 range requests\nfor IFD headers, **all sequential, no concurrency**.\n\nFor T=15 timesteps × ~3 HTTP requests each = **45 round trips at ~100ms = 4.5s\nof pure header overhead**, before any pixel data flows.\n\nRasteret pre-caches all IFD metadata in the GeoParquet index, then fires\nT × C `read_cog()` calls via `asyncio.gather`, all concurrent." |
| 332 | + "source": [ |
| 333 | + "---\n", |
| 334 | + "\n", |
| 335 | + "## What's different under the hood\n", |
| 336 | + "\n", |
| 337 | + "Both paths produce `[batch, T, C, H, W]`, all timesteps stacked per chip.\n", |
| 338 | + "\n", |
| 339 | + "| | Path A: TorchGeo | Path B: Rasteret |\n", |
| 340 | + "|---|---|---|\n", |
| 341 | + "| **Index build** | `rasterio.open()` per COG over HTTP | Pre-built GeoParquet (read from disk) |\n", |
| 342 | + "| **Time series read** | Sequential: one `rasterio.merge()` per timestep | All T timesteps fired concurrently |\n", |
| 343 | + "| **HTTP overhead per timestep** | HEAD + IFD ranges + pixel ranges | Pixel ranges only (headers cached) |\n", |
| 344 | + "| **Concurrency** | None; GDAL reads are serial | asyncio.gather across all T × C reads |\n", |
| 345 | + "\n", |
| 346 | + "### Where the bottleneck is\n", |
| 347 | + "\n", |
| 348 | + "TorchGeo's `_merge_or_stack` with `time_series=True`:\n", |
| 349 | + "```python\n", |
| 350 | + "dest = np.stack([rasterio.merge.merge([fh], **kwargs)[0] for fh in vrt_fhs])\n", |
| 351 | + "```\n", |
| 352 | + "Each `fh` is a `WarpedVRT` wrapping a `rasterio.open(\"/vsicurl/...\")`.\n", |
| 353 | + "For cloud COGs, each `rasterio.open()` triggers HTTP HEAD + 1-3 range requests\n", |
| 354 | + "for IFD headers, **all sequential, no concurrency**.\n", |
| 355 | + "\n", |
| 356 | + "For T=15 timesteps × ~3 HTTP requests each = **45 round trips at ~100ms = 4.5s\n", |
| 357 | + "of pure header overhead**, before any pixel data flows.\n", |
| 358 | + "\n", |
| 359 | + "Rasteret pre-caches all IFD metadata in the GeoParquet index, then fires\n", |
| 360 | + "T × C `read_cog()` calls via `asyncio.gather`, all concurrent." |
| 361 | + ] |
307 | 362 | }, |
308 | 363 | { |
309 | 364 | "cell_type": "markdown", |
310 | 365 | "id": "when-to-use", |
311 | 366 | "metadata": {}, |
312 | | - "source": "## When to use which\n\n| Scenario | Recommendation |\n|----------|---------------|\n| Cloud-hosted tiled GeoTIFFs (COGs) | **Rasteret** (over 20x faster) |\n| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |\n| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |\n| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |\n\nRasteret does not replace TorchGeo - it accelerates the data loading underneath.\nFor the full ecosystem picture, see [Ecosystem Comparison](https://terrafloww.github.io/rasteret/explanation/interop/)." |
| 367 | + "source": [ |
| 368 | + "## When to use which\n", |
| 369 | + "\n", |
| 370 | + "| Scenario | Recommendation |\n", |
| 371 | + "|----------|---------------|\n", |
| 372 | + "| Cloud-hosted tiled GeoTIFFs (COGs) | **Rasteret** (over 20x faster) |\n", |
| 373 | + "| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |\n", |
| 374 | + "| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |\n", |
| 375 | + "| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |\n", |
| 376 | + "\n", |
| 377 | + "Rasteret does not replace TorchGeo - it accelerates the data loading underneath.\n", |
| 378 | + "For the full ecosystem picture, see [Ecosystem Comparison](https://terrafloww.github.io/rasteret/explanation/interop/)." |
| 379 | + ] |
313 | 380 | }, |
314 | 381 | { |
315 | 382 | "cell_type": "markdown", |
316 | 383 | "id": "fhisivmz2xw", |
317 | 384 | "metadata": {}, |
318 | | - "source": "---\n\n## Section 2: Multi-AOI Scaling\n\nThe single-AOI comparison above uses 1 region and 15 scenes. Real training\npipelines cover **multiple regions** across a full year of imagery.\n\nThis section tests: **does the speedup hold (or grow) when we scale up?**\n\n- 5 AOIs across southern India (~180 km spread)\n- Full-year date range → 30 scenes\n- Larger batch (4 chips × 16 samples)\n- CRS auto-detected from the data (no hardcoded EPSG)\n\nBoth paths use the same `RandomGeoSampler`, no `roi` constraint, so the\nsampler weights by scene area and draws chips from anywhere in the index." |
| 385 | + "source": [ |
| 386 | + "---\n", |
| 387 | + "\n", |
| 388 | + "## Section 2: Multi-AOI Scaling\n", |
| 389 | + "\n", |
| 390 | + "The single-AOI comparison above uses 1 region and 15 scenes. Real training\n", |
| 391 | + "pipelines cover **multiple regions** across a full year of imagery.\n", |
| 392 | + "\n", |
| 393 | + "This section tests: **does the speedup hold (or grow) when we scale up?**\n", |
| 394 | + "\n", |
| 395 | + "- 5 AOIs across southern India (~180 km spread)\n", |
| 396 | + "- Full-year date range → 30 scenes\n", |
| 397 | + "- Larger batch (4 chips × 16 samples)\n", |
| 398 | + "- CRS auto-detected from the data (no hardcoded EPSG)\n", |
| 399 | + "\n", |
| 400 | + "Both paths use the same `RandomGeoSampler`, no `roi` constraint, so the\n", |
| 401 | + "sampler weights by scene area and draws chips from anywhere in the index." |
| 402 | + ] |
319 | 403 | }, |
320 | 404 | { |
321 | 405 | "cell_type": "code", |
|
547 | 631 | "cell_type": "markdown", |
548 | 632 | "id": "w7g93r0rfgn", |
549 | 633 | "metadata": {}, |
550 | | - "source": "### Multi-AOI takeaways\n\n- **CRS auto-detection**: TorchGeo infers CRS from the first file (`crs=None`).\n Rasteret derives it from the collection's `proj:epsg` metadata.\n Both expose the result via `dataset.crs`: standard TorchGeo interop.\n\n- **Geometries as array**: Rasteret's `geometries=[aoi1, aoi2, ...]` accepts\n multiple polygons in WGS84 (or any CRS via `geometries_crs=`). Internally\n each polygon is reprojected to the dataset's native CRS and unioned for\n spatial filtering. TorchGeo's `roi=` only accepts a single polygon.\n\n- **Scaling**: As T (timesteps) and the number of scenes grow, TorchGeo's\n sequential `rasterio.open()` + `rasterio.merge()` loop scales linearly.\n Rasteret's `asyncio.gather` fires all reads concurrently, bounded only\n by network bandwidth and `max_concurrent`." |
| 634 | + "source": [ |
| 635 | + "### Multi-AOI takeaways\n", |
| 636 | + "\n", |
| 637 | + "- **CRS auto-detection**: TorchGeo infers CRS from the first file (`crs=None`).\n", |
| 638 | + " Rasteret derives it from the collection's `proj:epsg` metadata.\n", |
| 639 | + " Both expose the result via `dataset.crs`: standard TorchGeo interop.\n", |
| 640 | + "\n", |
| 641 | + "- **Geometries as array**: Rasteret's `geometries=[aoi1, aoi2, ...]` accepts\n", |
| 642 | + " multiple polygons in WGS84 (or any CRS via `geometries_crs=`). Internally\n", |
| 643 | + " each polygon is reprojected to the dataset's native CRS and unioned for\n", |
| 644 | + " spatial filtering. TorchGeo's `roi=` only accepts a single polygon.\n", |
| 645 | + "\n", |
| 646 | + "- **Scaling**: As T (timesteps) and the number of scenes grow, TorchGeo's\n", |
| 647 | + " sequential `rasterio.open()` + `rasterio.merge()` loop scales linearly.\n", |
| 648 | + " Rasteret's `asyncio.gather` fires all reads concurrently, bounded only\n", |
| 649 | + " by network bandwidth and `max_concurrent`.\n", |
| 650 | + "\n", |
| 651 | + "- **AOI-only sampling**: `geometries=[...]` filters which records/tiles are included in the dataset, but samplers still sample over the dataset index bounds.\n", |
| 652 | + " To restrict chips to an AOI (for example a county polygon), pass `roi=<AOI polygon in dataset CRS>` to `GridGeoSampler` / `RandomGeoSampler`.\n" |
| 653 | + ] |
551 | 654 | }, |
552 | 655 | { |
553 | 656 | "cell_type": "markdown", |
554 | 657 | "id": "q4r60eba4x", |
555 | 658 | "metadata": {}, |
556 | | - "source": "---\n\n## Section 3: Cross-CRS Boundary (Multi-Zone Reprojection)\n\nSections 1-2 stayed within a single UTM zone (EPSG:32643). Real workflows\noften span **UTM zone boundaries**, the 78°E meridian separates zone 43N\nfrom 44N, and Sentinel-2 tiles from each zone use different CRS.\n\nThis section places an AOI in the **overlap zone** east of Hyderabad where\ntiles `43QHV` (EPSG:32643) and `44QKE` (EPSG:32644) both provide coverage.\n\n- **TorchGeo**: Uses `WarpedVRT` to reproject each file to a common CRS on read\n- **Rasteret**: Uses `target_crs=32643` to keep all scenes and reprojects via\n `rasterio.warp.reproject()` after its concurrent fetch\n\nBoth paths end at the same `[batch, T, C, H, W]` tensor in EPSG:32643." |
| 659 | + "source": [ |
| 660 | + "---\n", |
| 661 | + "\n", |
| 662 | + "## Section 3: Cross-CRS Boundary (Multi-Zone Reprojection)\n", |
| 663 | + "\n", |
| 664 | + "Sections 1-2 stayed within a single UTM zone (EPSG:32643). Real workflows\n", |
| 665 | + "often span **UTM zone boundaries**, the 78°E meridian separates zone 43N\n", |
| 666 | + "from 44N, and Sentinel-2 tiles from each zone use different CRS.\n", |
| 667 | + "\n", |
| 668 | + "This section places an AOI in the **overlap zone** east of Hyderabad where\n", |
| 669 | + "tiles `43QHV` (EPSG:32643) and `44QKE` (EPSG:32644) both provide coverage.\n", |
| 670 | + "\n", |
| 671 | + "- **TorchGeo**: Uses `WarpedVRT` to reproject each file to a common CRS on read\n", |
| 672 | + "- **Rasteret**: Uses `target_crs=32643` to keep all scenes and reprojects via\n", |
| 673 | + " `rasterio.warp.reproject()` after its concurrent fetch\n", |
| 674 | + "\n", |
| 675 | + "Both paths end at the same `[batch, T, C, H, W]` tensor in EPSG:32643." |
| 676 | + ] |
557 | 677 | }, |
558 | 678 | { |
559 | 679 | "cell_type": "code", |
|
0 commit comments