terrafloww
diff --git a/‎README.md‎
Lines changed: 18 additions & 18 deletions b/‎README.md‎
Lines changed: 18 additions & 18 deletions
diff --git a/‎notebooks/05_torchgeo_comparison.ipynb‎
Lines changed: 127 additions & 7 deletions b/‎notebooks/05_torchgeo_comparison.ipynb‎
Lines changed: 127 additions & 7 deletions
@@ -38,7 +38,6 @@ Key Features -
 - **Zero data downloads** - work with terabytes of imagery while storing only megabytes of metadata.
 - **No STAC at training time** - query once at setup; zero API calls during training with Collection you can extend.
 - **Reproducible** - same Parquet index = same records = same results
-- **Point sampling built in** - `sample_points()` returns Arrow-native tables for feature pipelines and large point sets
 - **Native dtypes** - integer imagery stays integer; missing/edge coverage is represented via fill values (nodata or 0) instead of NaNs
 - **Shareable cache** - enrich our Collection with your ML splits, patch geometries, custom data points for ML, and share it, don't write folders of image chips!
 
@@ -188,35 +187,26 @@ ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)
 arr = collection.get_numpy(
     geometries=(77.55, 13.01, 77.58, 13.08),
     bands=["B04", "B08"],
-    all_touched=False,  # rasterio default masking semantics
 )
 # shape: [N, C, H, W] for multi-band, [N, H, W] for single-band
 ```
 
-### Point values (Arrow-native)
+### Point sampling
 
 ```python
-import duckdb
-
-# Keep your table in DuckDB and pass coordinate columns directly
-points = duckdb.sql("""
-    SELECT lon, lat
-    FROM read_parquet('points.parquet')
-""").arrow().read_all()
+from shapely.geometry import Point
 
 samples = collection.sample_points(
-    points=points,     # query input table/array of point locations
-    x_column="lon",
-    y_column="lat",
+    points=[Point(77.56, 13.03), Point(77.57, 13.04)],
     bands=["B04", "B08"],
     geometry_crs=4326,
-    match="latest",        # or "all" for full time series matches
 )
-# pyarrow.Table with point_index, record_id, datetime, band, value, point_crs, raster_crs
+# PyArrow Table — one row per (point, band, record)
 ```
 
-Collection-centric loop:
-`build/load/as_collection -> subset/where -> get_xarray/get_numpy/sample_points/to_torchgeo_dataset`.
+Reads only the tiles containing your points. Works with Shapely points,
+or pass a PyArrow table with coordinate columns for millions of points.
+No extras needed — available in the base install.
 
 <details>
 <summary><strong>Going further</strong></summary>
@@ -225,18 +215,21 @@ Collection-centric loop:
 |---|---|
 | Datasets not in the catalog | [`build_from_stac()`](https://terrafloww.github.io/rasteret/how-to/collection-management/) |
 | Parquet with COG URLs (Source Cooperative, STAC GeoParquet, custom) | [`build_from_table(path, name=...)`](https://terrafloww.github.io/rasteret/how-to/build-from-parquet/) |
+| Sample values at many points (Arrow-native) | [`sample_points()`](https://terrafloww.github.io/rasteret/how-to/point-sampling-and-masking/) |
 | Multi-band COGs (AEF embeddings, etc.) | [AEF Embeddings guide](https://terrafloww.github.io/rasteret/how-to/aef-embeddings/) |
 | Authenticated sources (PC, requester-pays, Earthdata, etc.) | [Custom Cloud Provider](https://terrafloww.github.io/rasteret/how-to/custom-cloud-provider/) |
 | Share a Collection | `collection.export("path/")` then `rasteret.load("path/")` |
 | Filter by cloud cover, date, bbox | [`collection.subset()`](https://terrafloww.github.io/rasteret/how-to/collection-management/) |
-| Sample values for large point sets | [`collection.sample_points()`](https://terrafloww.github.io/rasteret/how-to/point-sampling-and-masking/) |
 
 </details>
 
 ---
 
 ## Benchmarks
 
+<details>
+<summary><strong>Single request performance (time series query)</strong></summary>
+
 ### Single request performance
 
 Processing pipeline: Filter 450,000 scenes -> 22 matches -> Read 44 COG files
@@ -253,6 +246,8 @@ Run on AWS t3.xlarge (4 CPU) —
 | **Rasteret** | 3 s | 3 s |
 | **Google Earth Engine** | 10–30 s | 3–5 s |
 
+</details>
+
 ### Cold-start comparison with TorchGeo
 
 Same AOIs, same scenes, same sampler, same DataLoader. Both paths output
@@ -274,6 +269,9 @@ for full methodology.
 ![Processing time comparison](./assets/benchmark_results.png)
 ![Speedup breakdown](./assets/benchmark_breakdown.png)
 
+<details>
+<summary><strong>HF baseline (payload-Parquet patches)</strong></summary>
+
 ### HF `datasets` baseline (Major TOM keyed patches)
 
 Baseline method: `datasets.load_dataset(..., streaming=True, filters=...)` with
@@ -302,6 +300,8 @@ Notebook: [`05_torchgeo_comparison.ipynb`](docs/tutorials/05_torchgeo_comparison
 > [GitHub Discussions](https://github.com/terrafloww/rasteret/discussions/categories/show-and-tell)
 > or [Discord](https://discord.gg/V5vvuEBc).
 
+</details>
+
 ---
 
 ## Scope and stability
 
@@ -4,7 +4,21 @@
    "cell_type": "markdown",
    "id": "intro",
    "metadata": {},
-   "source": "# TorchGeo Benchmark\n\nThis notebook runs an **identical time-series workload** through two paths:\n\n- **Path A**: TorchGeo 0.9 `time_series=True`: sequential per-file reads via GDAL vsicurl\n- **Path B**: Rasteret `to_torchgeo_dataset(time_series=True)` with default backend routing\n\n**Controlled variables** (same across all paths):\n- AOI, date range, band (B04), scene count, chip size (256), batch size (2)\n- Same `RandomGeoSampler` → `DataLoader` → batch flow\n- Same output shape: `[batch, T, C, H, W]`, all timesteps stacked\n\nThe only difference is **how pixel data reaches the DataLoader**."
+   "source": [
+    "# TorchGeo Benchmark\n",
+    "\n",
+    "This notebook runs an **identical time-series workload** through two paths:\n",
+    "\n",
+    "- **Path A**: TorchGeo 0.9 `time_series=True`: sequential per-file reads via GDAL vsicurl\n",
+    "- **Path B**: Rasteret `to_torchgeo_dataset(time_series=True)` with default backend routing\n",
+    "\n",
+    "**Controlled variables** (same across all paths):\n",
+    "- AOI, date range, band (B04), scene count, chip size (256), batch size (2)\n",
+    "- Same `RandomGeoSampler` → `DataLoader` → batch flow\n",
+    "- Same output shape: `[batch, T, C, H, W]`, all timesteps stacked\n",
+    "\n",
+    "The only difference is **how pixel data reaches the DataLoader**."
+   ]
   },
   {
    "cell_type": "code",
@@ -177,7 +191,19 @@
    "cell_type": "markdown",
    "id": "path-b-header",
    "metadata": {},
-   "source": "---\n\n## Path B: Rasteret `time_series=True`\n\nRasteret caches COG header metadata (IFD offsets, byte counts, transforms) in\na local GeoParquet index.  At read time, it skips IFD parsing entirely and\nfires **concurrent** HTTP range requests for pixel data across ALL timesteps.\n\n`to_torchgeo_dataset(time_series=True)` returns a standard `GeoDataset`,\nsame samplers, same DataLoader, same `stack_samples` collate function as Path A.\nEach sample returns `[T, C, H, W]` with all timesteps stacked."
+   "source": [
+    "---\n",
+    "\n",
+    "## Path B: Rasteret `time_series=True`\n",
+    "\n",
+    "Rasteret caches COG header metadata (IFD offsets, byte counts, transforms) in\n",
+    "a local GeoParquet index.  At read time, it skips IFD parsing entirely and\n",
+    "fires **concurrent** HTTP range requests for pixel data across ALL timesteps.\n",
+    "\n",
+    "`to_torchgeo_dataset(time_series=True)` returns a standard `GeoDataset`,\n",
+    "same samplers, same DataLoader, same `stack_samples` collate function as Path A.\n",
+    "Each sample returns `[T, C, H, W]` with all timesteps stacked."
+   ]
   },
   {
    "cell_type": "code",
@@ -303,19 +329,77 @@
    "cell_type": "markdown",
    "id": "comparison",
    "metadata": {},
-   "source": "---\n\n## What's different under the hood\n\nBoth paths produce `[batch, T, C, H, W]`, all timesteps stacked per chip.\n\n| | Path A: TorchGeo | Path B: Rasteret |\n|---|---|---|\n| **Index build** | `rasterio.open()` per COG over HTTP | Pre-built GeoParquet (read from disk) |\n| **Time series read** | Sequential: one `rasterio.merge()` per timestep | All T timesteps fired concurrently |\n| **HTTP overhead per timestep** | HEAD + IFD ranges + pixel ranges | Pixel ranges only (headers cached) |\n| **Concurrency** | None; GDAL reads are serial | asyncio.gather across all T × C reads |\n\n### Where the bottleneck is\n\nTorchGeo's `_merge_or_stack` with `time_series=True`:\n```python\ndest = np.stack([rasterio.merge.merge([fh], **kwargs)[0] for fh in vrt_fhs])\n```\nEach `fh` is a `WarpedVRT` wrapping a `rasterio.open(\"/vsicurl/...\")`.\nFor cloud COGs, each `rasterio.open()` triggers HTTP HEAD + 1-3 range requests\nfor IFD headers, **all sequential, no concurrency**.\n\nFor T=15 timesteps × ~3 HTTP requests each = **45 round trips at ~100ms = 4.5s\nof pure header overhead**, before any pixel data flows.\n\nRasteret pre-caches all IFD metadata in the GeoParquet index, then fires\nT × C `read_cog()` calls via `asyncio.gather`, all concurrent."
+   "source": [
+    "---\n",
+    "\n",
+    "## What's different under the hood\n",
+    "\n",
+    "Both paths produce `[batch, T, C, H, W]`, all timesteps stacked per chip.\n",
+    "\n",
+    "| | Path A: TorchGeo | Path B: Rasteret |\n",
+    "|---|---|---|\n",
+    "| **Index build** | `rasterio.open()` per COG over HTTP | Pre-built GeoParquet (read from disk) |\n",
+    "| **Time series read** | Sequential: one `rasterio.merge()` per timestep | All T timesteps fired concurrently |\n",
+    "| **HTTP overhead per timestep** | HEAD + IFD ranges + pixel ranges | Pixel ranges only (headers cached) |\n",
+    "| **Concurrency** | None; GDAL reads are serial | asyncio.gather across all T × C reads |\n",
+    "\n",
+    "### Where the bottleneck is\n",
+    "\n",
+    "TorchGeo's `_merge_or_stack` with `time_series=True`:\n",
+    "```python\n",
+    "dest = np.stack([rasterio.merge.merge([fh], **kwargs)[0] for fh in vrt_fhs])\n",
+    "```\n",
+    "Each `fh` is a `WarpedVRT` wrapping a `rasterio.open(\"/vsicurl/...\")`.\n",
+    "For cloud COGs, each `rasterio.open()` triggers HTTP HEAD + 1-3 range requests\n",
+    "for IFD headers, **all sequential, no concurrency**.\n",
+    "\n",
+    "For T=15 timesteps × ~3 HTTP requests each = **45 round trips at ~100ms = 4.5s\n",
+    "of pure header overhead**, before any pixel data flows.\n",
+    "\n",
+    "Rasteret pre-caches all IFD metadata in the GeoParquet index, then fires\n",
+    "T × C `read_cog()` calls via `asyncio.gather`, all concurrent."
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "when-to-use",
    "metadata": {},
-   "source": "## When to use which\n\n| Scenario | Recommendation |\n|----------|---------------|\n| Cloud-hosted tiled GeoTIFFs (COGs) | **Rasteret** (over 20x faster) |\n| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |\n| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |\n| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |\n\nRasteret does not replace TorchGeo - it accelerates the data loading underneath.\nFor the full ecosystem picture, see [Ecosystem Comparison](https://terrafloww.github.io/rasteret/explanation/interop/)."
+   "source": [
+    "## When to use which\n",
+    "\n",
+    "| Scenario | Recommendation |\n",
+    "|----------|---------------|\n",
+    "| Cloud-hosted tiled GeoTIFFs (COGs) | **Rasteret** (over 20x faster) |\n",
+    "| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |\n",
+    "| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |\n",
+    "| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |\n",
+    "\n",
+    "Rasteret does not replace TorchGeo - it accelerates the data loading underneath.\n",
+    "For the full ecosystem picture, see [Ecosystem Comparison](https://terrafloww.github.io/rasteret/explanation/interop/)."
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "fhisivmz2xw",
    "metadata": {},
-   "source": "---\n\n## Section 2: Multi-AOI Scaling\n\nThe single-AOI comparison above uses 1 region and 15 scenes.  Real training\npipelines cover **multiple regions** across a full year of imagery.\n\nThis section tests: **does the speedup hold (or grow) when we scale up?**\n\n- 5 AOIs across southern India (~180 km spread)\n- Full-year date range → 30 scenes\n- Larger batch (4 chips × 16 samples)\n- CRS auto-detected from the data (no hardcoded EPSG)\n\nBoth paths use the same `RandomGeoSampler`, no `roi` constraint, so the\nsampler weights by scene area and draws chips from anywhere in the index."
+   "source": [
+    "---\n",
+    "\n",
+    "## Section 2: Multi-AOI Scaling\n",
+    "\n",
+    "The single-AOI comparison above uses 1 region and 15 scenes.  Real training\n",
+    "pipelines cover **multiple regions** across a full year of imagery.\n",
+    "\n",
+    "This section tests: **does the speedup hold (or grow) when we scale up?**\n",
+    "\n",
+    "- 5 AOIs across southern India (~180 km spread)\n",
+    "- Full-year date range → 30 scenes\n",
+    "- Larger batch (4 chips × 16 samples)\n",
+    "- CRS auto-detected from the data (no hardcoded EPSG)\n",
+    "\n",
+    "Both paths use the same `RandomGeoSampler`, no `roi` constraint, so the\n",
+    "sampler weights by scene area and draws chips from anywhere in the index."
+   ]
   },
   {
    "cell_type": "code",
@@ -547,13 +631,49 @@
    "cell_type": "markdown",
    "id": "w7g93r0rfgn",
    "metadata": {},
-   "source": "### Multi-AOI takeaways\n\n- **CRS auto-detection**: TorchGeo infers CRS from the first file (`crs=None`).\n  Rasteret derives it from the collection's `proj:epsg` metadata.\n  Both expose the result via `dataset.crs`: standard TorchGeo interop.\n\n- **Geometries as array**: Rasteret's `geometries=[aoi1, aoi2, ...]` accepts\n  multiple polygons in WGS84 (or any CRS via `geometries_crs=`).  Internally\n  each polygon is reprojected to the dataset's native CRS and unioned for\n  spatial filtering.  TorchGeo's `roi=` only accepts a single polygon.\n\n- **Scaling**: As T (timesteps) and the number of scenes grow, TorchGeo's\n  sequential `rasterio.open()` + `rasterio.merge()` loop scales linearly.\n  Rasteret's `asyncio.gather` fires all reads concurrently, bounded only\n  by network bandwidth and `max_concurrent`."
+   "source": [
+    "### Multi-AOI takeaways\n",
+    "\n",
+    "- **CRS auto-detection**: TorchGeo infers CRS from the first file (`crs=None`).\n",
+    "  Rasteret derives it from the collection's `proj:epsg` metadata.\n",
+    "  Both expose the result via `dataset.crs`: standard TorchGeo interop.\n",
+    "\n",
+    "- **Geometries as array**: Rasteret's `geometries=[aoi1, aoi2, ...]` accepts\n",
+    "  multiple polygons in WGS84 (or any CRS via `geometries_crs=`).  Internally\n",
+    "  each polygon is reprojected to the dataset's native CRS and unioned for\n",
+    "  spatial filtering.  TorchGeo's `roi=` only accepts a single polygon.\n",
+    "\n",
+    "- **Scaling**: As T (timesteps) and the number of scenes grow, TorchGeo's\n",
+    "  sequential `rasterio.open()` + `rasterio.merge()` loop scales linearly.\n",
+    "  Rasteret's `asyncio.gather` fires all reads concurrently, bounded only\n",
+    "  by network bandwidth and `max_concurrent`.\n",
+    "\n",
+    "- **AOI-only sampling**: `geometries=[...]` filters which records/tiles are included in the dataset, but samplers still sample over the dataset index bounds.\n",
+    "  To restrict chips to an AOI (for example a county polygon), pass `roi=<AOI polygon in dataset CRS>` to `GridGeoSampler` / `RandomGeoSampler`.\n"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "q4r60eba4x",
    "metadata": {},
-   "source": "---\n\n## Section 3: Cross-CRS Boundary (Multi-Zone Reprojection)\n\nSections 1-2 stayed within a single UTM zone (EPSG:32643).  Real workflows\noften span **UTM zone boundaries**, the 78°E meridian separates zone 43N\nfrom 44N, and Sentinel-2 tiles from each zone use different CRS.\n\nThis section places an AOI in the **overlap zone** east of Hyderabad where\ntiles `43QHV` (EPSG:32643) and `44QKE` (EPSG:32644) both provide coverage.\n\n- **TorchGeo**: Uses `WarpedVRT` to reproject each file to a common CRS on read\n- **Rasteret**: Uses `target_crs=32643` to keep all scenes and reprojects via\n  `rasterio.warp.reproject()` after its concurrent fetch\n\nBoth paths end at the same `[batch, T, C, H, W]` tensor in EPSG:32643."
+   "source": [
+    "---\n",
+    "\n",
+    "## Section 3: Cross-CRS Boundary (Multi-Zone Reprojection)\n",
+    "\n",
+    "Sections 1-2 stayed within a single UTM zone (EPSG:32643).  Real workflows\n",
+    "often span **UTM zone boundaries**, the 78°E meridian separates zone 43N\n",
+    "from 44N, and Sentinel-2 tiles from each zone use different CRS.\n",
+    "\n",
+    "This section places an AOI in the **overlap zone** east of Hyderabad where\n",
+    "tiles `43QHV` (EPSG:32643) and `44QKE` (EPSG:32644) both provide coverage.\n",
+    "\n",
+    "- **TorchGeo**: Uses `WarpedVRT` to reproject each file to a common CRS on read\n",
+    "- **Rasteret**: Uses `target_crs=32643` to keep all scenes and reprojects via\n",
+    "  `rasterio.warp.reproject()` after its concurrent fetch\n",
+    "\n",
+    "Both paths end at the same `[batch, T, C, H, W]` tensor in EPSG:32643."
+   ]
   },
   {
    "cell_type": "code",