Skip to content

Releases: terrafloww/rasteret

0.3.5

13 Mar 05:05

Choose a tag to compare

Changelog

v0.3.5

Added

  • Descriptor-backed dual-surface planning: catalog entries can now
    describe separate record-table, index, and collection surfaces via
    record_table_uri, index_uri, and collection_uri. This lets Rasteret
    keep build-time metadata sources separate from published runtime
    collections.
  • Collection.head(): first-class metadata preview API that uses the
    narrow record index when available instead of forcing a wide collection
    scan.
  • Internal Parquet read planner: ParquetReadPlanner keeps index-side and
    wide-scan filter state coherent across subset(), where(), len(),
    head(), pixel reads, and TorchGeo entry points.
  • Hugging Face streaming runtime: HFStreamingSource and batch iterators
    now provide a stable metadata path for hf://datasets/... collections
    without routing through the older datasets streaming shutdown path.

Changed

  • Index-first runtime filtering: when a collection has both a narrow index
    and a wide read-ready surface, Rasteret now plans metadata filters against
    the index first, then carries compatible predicates into the wide scan.
  • GeoParquet bbox contract: bbox filtering now targets the canonical
    bbox struct (xmin, ymin, xmax, ymax) instead of older scalar bbox
    columns. Catalog tests, CLI fixtures, and filter tests were updated to
    match the GeoParquet 1.1 shape.
  • TorchGeo filter propagation: to_torchgeo_dataset() now respects
    collection filters and geometry / bbox narrowing before chip sampling
    starts, reducing unnecessary raster candidates for gridded reads.
  • Local Parquet metadata reuse: local datasets prefer a _metadata
    sidecar when present, and Rasteret now keeps an in-process Parquet dataset
    cache to reduce repeated footer/schema setup during interactive sessions.
  • Catalog surface reporting: CLI output and descriptor helpers now surface
    record-table, index, and collection paths explicitly instead of collapsing
    everything into geoparquet_uri.

Fixed

  • Hugging Face filter execution: Arrow expression handling on HF-backed
    metadata batches now fails more clearly and avoids the unstable runtime path
    that could crash during shutdown or long scans.
  • AEF runtime aliasing: the built-in AEF descriptor now points at the
    published Terrafloww collection/index surfaces with explicit field-role and
    filter-capability hints.
  • Read-path correctness on filtered collections: len(), head(),
    sample_points(), TorchGeo reads, and other collection-backed reads now
    stay aligned when filters are staged across both record-index and wide-data
    surfaces.

Tested

  • Expanded test_collection_filters for bbox-struct filtering and TorchGeo
    prefilter behavior.
  • Expanded test_catalog, test_cli, and test_public_api_surface for the
    new descriptor surface roles and runtime load/build paths.
  • Expanded test_huggingface, test_execution, and test_torchgeo_adapter
    for the new HF runtime path, filter propagation, and collection-read
    behavior.

0.3.4

08 Mar 10:17

Choose a tag to compare

v0.3.4

New features 🚀

  • HuggingFace integration: rasteret.load("hf://datasets/<org>/<repo>")
    resolves remote Parquet shards via huggingface_hub and loads them as a
    Collection. No local clone or download step needed.
  • AEF v1 Annual Rasteret Collection: prebuilt read-ready collection
    published on HuggingFace
    and Source Cooperative.
    235K tiles, 64-band int8 embeddings, global coverage 2017–2024.
  • AEF similarity search tutorial (notebooks/07_aef_similarity_search):
    end-to-end embedding similarity search using sample_points, get_xarray,
    get_gdf, and TorchGeo GridGeoSampler. Demonstrates DuckDB Arrow-native
    pivot for reference vectors and lonboard GPU-accelerated visualization.

Changed 🪛

  • Unified tile engine: RasterAccessor consolidated into a single
    _read_tile() code path shared by get_xarray, get_numpy, get_gdf,
    and TorchGeo __getitem__. Four divergent tile-read implementations
    replaced with one.
  • COGReader simplified: removed duplicate decode paths, tightened the
    fetch → decompress → crop pipeline, reduced code surface by ~200 lines.
  • TorchGeo edge-chip hardening: empty-read validation returns
    nodata-filled tensors for chips outside coverage; positive-overlap
    filtering skips false bbox-only candidates; fallback loop fills chips
    with nodata when all candidate tiles fail instead of crashing the
    DataLoader.
  • Error surfacing: get_gdf and get_numpy warn on partial
    band/geometry/record failures instead of returning silent empty results.
    Point geometry AOIs raise UnsupportedGeometryError pointing to
    sample_points(). Exception chaining (raise ... from e) throughout
    the read pipeline.

Fixed 🔨

  • sample_points COG read path: tile metadata validation was skipping
    valid tiles when the source raster had multiple matching records.
  • xr_combine parameter now correctly plumbed through get_xarray().

Tested 🗒️

  • test_torchgeo_error_propagation: edge-chip, empty-read, and fallback
    loop coverage.
  • test_huggingface: URI resolution and table loading mocks.
  • test_public_api_surface: validates all public imports from rasteret.
  • test_public_network_smoke: live AEF reads and point sampling parity
    against rasterio.sample().
  • Extended test_execution with get_gdf error path coverage.

0.3.3

06 Mar 19:40

Choose a tag to compare

v0.3.3

Performance 🚀

  • Arrow-batch native point sampling: sample_points() internals now use
    vectorized NumPy gathers and Table.from_batches instead of per-sample
    Python append + pa.concat_tables. Eliminates row materialization in the
    hot loop.
  • COGReader session reuse: read_cog() accepts a shared reader=
    parameter. Point sampling across multiple rasters now reuses a single
    COGReader (and its HTTP/2 connection pool) instead of creating one per
    raster, reducing connection overhead for multi-scene workloads.

Refactored 🧹

  • Dedicated point_sampling module: point sampling ownership moved from
    execution.py to rasteret.core.point_sampling. execution.py stays
    focused on area/chip reads (get_xarray, get_numpy, get_gdf).
  • POINT_SAMPLES_SCHEMA defined in types.py as the single source of
    truth for point sample output columns. Nullable fields explicitly modeled.
  • Point input helpers consolidated: ensure_point_geoarrow,
    candidate_point_indices_for_raster, and related helpers moved into
    geometry.py.
  • Strict tile/source alignment guard in RasterAccessor.sample_points:
    validates that tile metadata matches the source raster before sampling.

Changed

  • Landing page (docs/index.md) code example simplified: single PyArrow
    import, cleaner sample_points call, HF benchmark collapsed into
    admonition.

0.3.2

06 Mar 06:03

Choose a tag to compare

New Features 🎉

  • Collection.sample_points(): first-class point sampling API returning a
    pyarrow.Table (point_index, record_id, datetime, band, value,
    CRS columns, and metadata fields). Supports match="all" and
    match="latest".
  • New guide: Point Sampling and Masking.

Improvements 🚀

  • Masking controls surfaced at Collection level:
    get_xarray(), get_gdf(), and get_numpy() now accept
    all_touched=... directly.
  • Table-native point sampling inputs:
    Collection.sample_points(...) now accepts Arrow tables and common
    dataframe/relation inputs with x_column/y_column or
    geometry_column (WKB/GeoArrow/shapely point column).
  • Point sampling aligns to rasterio sample() semantics (nearest-pixel
    index math) for deterministic parity on real COGs.
  • Core read internals are more Arrow-native:
    • iterate_rasters() now scans columnar batches (no batch.to_pylist() in
      the core read iterator),
    • infer_data_source() uses filtered scanner.head(1),
    • multi-CRS detection in xarray path streams proj:epsg counts per batch,
    • sample_points() uses vectorized geoarrow.point_coords.
  • Error messaging for non-OGC binary geometry input now explicitly points
    DuckDB users to ST_AsWKB(geom) when needed.
  • Network parity coverage is tighter:
    • AOI windowing now matches rasterio geometry_window() edge semantics
      exactly (fixes the WorldCover 1-pixel mismatch),
    • transient STAC API timeouts are retried during live builds,
    • the AEF south-up TorchGeo oracle path is corrected for manual/explicit
      parity runs via WarpedVRT.

Additions

  • sedonadb>=0.2.0 added to the examples extra for Arrow-native point
    workflow examples.

0.3.1

27 Feb 12:18

Choose a tag to compare

Rasteret 0.3.1

Made to beat cold starts. Index-first access to cloud-native GeoTIFF collections for ML and geospatial analysis.

pip install rasteret==0.3.1

Docs: terrafloww.github.io/rasteret | Discord: discord.gg/V5vvuEBc


What Rasteret does

Every cold start re-parses satellite image metadata over HTTP — per scene, per band. Rasteret parses those headers once, caches them in Parquet, and its own reader fetches pixels concurrently with no GDAL in the path. Up to 20x faster on cold starts.

Because the index is Parquet, it's also the table you work with — filter, join, enrich, and query with standard tools before you ever fetch a pixel.

import rasteret

collection = rasteret.build("earthsearch/sentinel-2-l2a",
                            name="s2", bbox=(...), date_range=(...))

sub = collection.subset(cloud_cover_lt=20, date_range=("2024-03-01", "2024-06-01"))
arr = sub.get_numpy(geometries=(-122.5, 37.7, -122.3, 37.9), bands=["B04", "B08"])

What's new in 0.3.1

as_collection() — lightweight re-entry from Arrow tables

Rasteret's design is that you should be free to enrich your dataset with whatever tools you know (DuckDB, Polars, PyArrow) and come back to Rasteret when you need pixels. as_collection() validates that the required columns and COG metadata are intact and wraps the table — no normalization, no COG parsing, no disk I/O.

base = rasteret.build("earthsearch/sentinel-2-l2a", name="s2", ...)
table = base.dataset.to_table()
table = table.append_column("split", splits)
table = table.append_column("label", labels)

collection = rasteret.as_collection(table, name="enriched", data_source=base.data_source)
collection.subset(split="train").get_numpy(geometries=..., bands=["B04", "B08"])

Three entry points, clear separation:

Function When to use Rebuilds? Persists?
build() / build_from_stac() / build_from_table() First-time ingest from STAC or external Parquet Yes Yes
load() Reopen an existing Collection from disk No No
as_collection() Re-wrap a read-ready Arrow table/dataset No No

What shipped in 0.3.0

This is a full revamp. Highlights:

License

AGPL-3.0 → Apache-2.0.

Dataset catalog

12 built-in datasets across Earth Search, Planetary Computer, and AlphaEarth Foundation. One line to build:

$ rasteret datasets list
earthsearch/sentinel-2-l2a    Sentinel-2 Level-2A                global         none
earthsearch/landsat-c2-l2     Landsat Collection 2 Level-2       global         required
earthsearch/naip              NAIP                               north-america  required
earthsearch/cop-dem-glo-30    Copernicus DEM 30m                 global         none
earthsearch/cop-dem-glo-90    Copernicus DEM 90m                 global         none
pc/sentinel-2-l2a             Sentinel-2 (Planetary Computer)    global         required
pc/io-lulc-annual-v02         ESRI 10m Land Use/Land Cover       global         required
pc/alos-dem                   ALOS World 3D 30m DEM              global         required
pc/nasadem                    NASADEM                            global         required
pc/esa-worldcover             ESA WorldCover                     global         required
pc/usda-cdl                   USDA Cropland Data Layer           conus          required
aef/v1-annual                 AlphaEarth Foundation Embeddings   global         none

The catalog is open and community-driven. Each entry is ~20 lines of Python. One PR adds a dataset, every user gets access on the next release.

Four output paths

All from the same Collection, all sharing the same async tile I/O:

ds  = collection.get_xarray(geometries=bbox, bands=["B04", "B08"])     # xarray.Dataset
arr = collection.get_numpy(geometries=bbox, bands=["B04", "B08"])      # [N, C, H, W]
gdf = collection.get_gdf(geometries=bbox, bands=["B04", "B08"])        # GeoDataFrame
dataset = collection.to_torchgeo_dataset(bands=["B04", "B03", "B02"])  # TorchGeo GeoDataset

TorchGeo adapter

to_torchgeo_dataset() returns a standard GeoDataset backed by Rasteret's async COG reader. Supports time series ([T, C, H, W]), per-sample labels via label_field, cross-CRS reprojection, mixed-resolution bands. Works with all TorchGeo samplers, IntersectionDataset, UnionDataset.

Your dataset is a table

Filter by cloud cover, date, bbox, split. Join with labels or AOI polygons. Query with DuckDB. Add train/val/test splits as columns. The Collection is the dataset.

filtered = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))
train = collection.subset(split="train")
clear = collection.where(ds.field("eo:cloud_cover") < 5.0)

Multi-cloud

S3, Azure Blob, GCS via Obstore. Requester-pays, signed URLs, credential providers all supported.

build_from_table()

Build a Collection from any Parquet with COG URLs — Source Cooperative exports, STAC GeoParquet, custom catalogs. No STAC API required.

Native dtype preservation

uint16 stays uint16 in tensors. No silent float32 promotion.

Major-TOM dataset creation on-the-fly example

Rebuild Major TOM-style patch-grid semantics from source Sentinel-2 COGs instead of image-in-Parquet. 3.9–6.5x faster than HF datasets Parquet-filter reads.


Benchmarks

Cold-start comparison with TorchGeo

Same AOIs, same scenes, same sampler, same DataLoader. No HTTP cache, no OS page cache.

Scenario rasterio/GDAL Rasteret Ratio
Single AOI, 15 scenes 9.08 s 1.14 s 8x
Multi-AOI, 30 scenes 42.05 s 2.25 s 19x
Cross-CRS boundary, 12 scenes 12.47 s 0.59 s 21x

HF datasets baseline (Major TOM keyed patches)

Patches HF datasets parquet filters Rasteret index+COG Speedup
120 46.83 s 12.09 s 3.88x
1000 771.59 s 118.69 s 6.50x

Requirements

  • Python 3.12+
  • rasterio>=1.4.3,<1.5.0 (geometry masking, CRS reprojection; not in the tile-read path)

Install

pip install rasteret==0.3.1

# Extras
pip install "rasteret[xarray]"       # + xarray output
pip install "rasteret[torchgeo]"     # + TorchGeo for ML pipelines
pip install "rasteret[aws]"          # + requester-pays buckets
pip install "rasteret[azure]"        # + Planetary Computer signed URLs

Rasteret is Apache-2.0, fully open source. All contributions welcome.

Docs · GitHub · Discord

v0.2.1

22 Sep 06:44

Choose a tag to compare

What's Changed

  • move to uv package manger, and also fixes to CI by @satishjasthi in #15
  • Patch fix for aws s3 url signing by @print-sid8 a121a2f
  • Rewrite example codes and Readme to be easier, and also add missing imports in readme code blocks, reported by @krishnaglodha

Full Changelog: v0.1.20...v0.2.1

v0.1.20 - performance improvements

20 Aug 08:56

Choose a tag to compare

Release v0.1.20 - Stable Performance Release

Performance Improvements:

  • Migrated from httpx to aiohttp for HTTP requests
  • 5x faster rasteret collection creation
  • 2x faster COG data querying

Stability:

Library now considered stable for its current features, future API changes can still be expected in coming weeks.

uv pip install rasteret will install 0.1.20, pypi automated pipeline is broken, for now manually pushed the version

This release addresses issue #6 with the migration to aiohttp, delivering significant performance gains for both collection indexing and COG data access operations.