13 Mar 05:05

print-sid8

cb7f185

0.3.5 Latest

Latest

Changelog

v0.3.5

Added

Descriptor-backed dual-surface planning: catalog entries can now
describe separate record-table, index, and collection surfaces via
record_table_uri, index_uri, and collection_uri. This lets Rasteret
keep build-time metadata sources separate from published runtime
collections.
Collection.head(): first-class metadata preview API that uses the
narrow record index when available instead of forcing a wide collection
scan.
Internal Parquet read planner: ParquetReadPlanner keeps index-side and
wide-scan filter state coherent across subset(), where(), len(),
head(), pixel reads, and TorchGeo entry points.
Hugging Face streaming runtime: HFStreamingSource and batch iterators
now provide a stable metadata path for hf://datasets/... collections
without routing through the older datasets streaming shutdown path.

Changed

Index-first runtime filtering: when a collection has both a narrow index
and a wide read-ready surface, Rasteret now plans metadata filters against
the index first, then carries compatible predicates into the wide scan.
GeoParquet bbox contract: bbox filtering now targets the canonical
bbox struct (xmin, ymin, xmax, ymax) instead of older scalar bbox
columns. Catalog tests, CLI fixtures, and filter tests were updated to
match the GeoParquet 1.1 shape.
TorchGeo filter propagation: to_torchgeo_dataset() now respects
collection filters and geometry / bbox narrowing before chip sampling
starts, reducing unnecessary raster candidates for gridded reads.
Local Parquet metadata reuse: local datasets prefer a _metadata
sidecar when present, and Rasteret now keeps an in-process Parquet dataset
cache to reduce repeated footer/schema setup during interactive sessions.
Catalog surface reporting: CLI output and descriptor helpers now surface
record-table, index, and collection paths explicitly instead of collapsing
everything into geoparquet_uri.

Fixed

Hugging Face filter execution: Arrow expression handling on HF-backed
metadata batches now fails more clearly and avoids the unstable runtime path
that could crash during shutdown or long scans.
AEF runtime aliasing: the built-in AEF descriptor now points at the
published Terrafloww collection/index surfaces with explicit field-role and
filter-capability hints.
Read-path correctness on filtered collections: len(), head(),
sample_points(), TorchGeo reads, and other collection-backed reads now
stay aligned when filters are staged across both record-index and wide-data
surfaces.

Tested

Expanded test_collection_filters for bbox-struct filtering and TorchGeo
prefilter behavior.
Expanded test_catalog, test_cli, and test_public_api_surface for the
new descriptor surface roles and runtime load/build paths.
Expanded test_huggingface, test_execution, and test_torchgeo_adapter
for the new HF runtime path, filter propagation, and collection-read
behavior.

Assets 2

08 Mar 10:17

print-sid8

v0.3.4

6fb842b

0.3.4

v0.3.4

New features 🚀

HuggingFace integration: rasteret.load("hf://datasets/<org>/<repo>")
resolves remote Parquet shards via huggingface_hub and loads them as a
Collection. No local clone or download step needed.
AEF v1 Annual Rasteret Collection: prebuilt read-ready collection
published on HuggingFace
and Source Cooperative.
235K tiles, 64-band int8 embeddings, global coverage 2017–2024.
AEF similarity search tutorial (notebooks/07_aef_similarity_search):
end-to-end embedding similarity search using sample_points, get_xarray,
get_gdf, and TorchGeo GridGeoSampler. Demonstrates DuckDB Arrow-native
pivot for reference vectors and lonboard GPU-accelerated visualization.

Changed 🪛

Unified tile engine: RasterAccessor consolidated into a single
_read_tile() code path shared by get_xarray, get_numpy, get_gdf,
and TorchGeo __getitem__. Four divergent tile-read implementations
replaced with one.
COGReader simplified: removed duplicate decode paths, tightened the
fetch → decompress → crop pipeline, reduced code surface by ~200 lines.
TorchGeo edge-chip hardening: empty-read validation returns
nodata-filled tensors for chips outside coverage; positive-overlap
filtering skips false bbox-only candidates; fallback loop fills chips
with nodata when all candidate tiles fail instead of crashing the
DataLoader.
Error surfacing: get_gdf and get_numpy warn on partial
band/geometry/record failures instead of returning silent empty results.
Point geometry AOIs raise UnsupportedGeometryError pointing to
sample_points(). Exception chaining (raise ... from e) throughout
the read pipeline.

Fixed 🔨

sample_points COG read path: tile metadata validation was skipping
valid tiles when the source raster had multiple matching records.
xr_combine parameter now correctly plumbed through get_xarray().

Tested 🗒️

test_torchgeo_error_propagation: edge-chip, empty-read, and fallback
loop coverage.
test_huggingface: URI resolution and table loading mocks.
test_public_api_surface: validates all public imports from rasteret.
test_public_network_smoke: live AEF reads and point sampling parity
against rasterio.sample().
Extended test_execution with get_gdf error path coverage.

Assets 2

06 Mar 19:40

print-sid8

v0.3.3

0e6f879

0.3.3

v0.3.3

Performance 🚀

Arrow-batch native point sampling: sample_points() internals now use
vectorized NumPy gathers and Table.from_batches instead of per-sample
Python append + pa.concat_tables. Eliminates row materialization in the
hot loop.
COGReader session reuse: read_cog() accepts a shared reader=
parameter. Point sampling across multiple rasters now reuses a single
COGReader (and its HTTP/2 connection pool) instead of creating one per
raster, reducing connection overhead for multi-scene workloads.

Refactored 🧹

Dedicated point_sampling module: point sampling ownership moved from
execution.py to rasteret.core.point_sampling. execution.py stays
focused on area/chip reads (get_xarray, get_numpy, get_gdf).
POINT_SAMPLES_SCHEMA defined in types.py as the single source of
truth for point sample output columns. Nullable fields explicitly modeled.
Point input helpers consolidated: ensure_point_geoarrow,
candidate_point_indices_for_raster, and related helpers moved into
geometry.py.
Strict tile/source alignment guard in RasterAccessor.sample_points:
validates that tile metadata matches the source raster before sampling.

Changed

Landing page (docs/index.md) code example simplified: single PyArrow
import, cleaner sample_points call, HF benchmark collapsed into
admonition.

Assets 2

06 Mar 06:03

print-sid8

v0.3.2

af26172

0.3.2

New Features 🎉

Collection.sample_points(): first-class point sampling API returning a
pyarrow.Table (point_index, record_id, datetime, band, value,
CRS columns, and metadata fields). Supports match="all" and
match="latest".
New guide: Point Sampling and Masking.

Improvements 🚀

Masking controls surfaced at Collection level:
get_xarray(), get_gdf(), and get_numpy() now accept
all_touched=... directly.
Table-native point sampling inputs:
Collection.sample_points(...) now accepts Arrow tables and common
dataframe/relation inputs with x_column/y_column or
geometry_column (WKB/GeoArrow/shapely point column).
Point sampling aligns to rasterio sample() semantics (nearest-pixel
index math) for deterministic parity on real COGs.
Core read internals are more Arrow-native:
- iterate_rasters() now scans columnar batches (no batch.to_pylist() in
  the core read iterator),
- infer_data_source() uses filtered scanner.head(1),
- multi-CRS detection in xarray path streams proj:epsg counts per batch,
- sample_points() uses vectorized geoarrow.point_coords.
Error messaging for non-OGC binary geometry input now explicitly points
DuckDB users to ST_AsWKB(geom) when needed.
Network parity coverage is tighter:
- AOI windowing now matches rasterio geometry_window() edge semantics
  exactly (fixes the WorldCover 1-pixel mismatch),
- transient STAC API timeouts are retried during live builds,
- the AEF south-up TorchGeo oracle path is corrected for manual/explicit
  parity runs via WarpedVRT.

Additions

sedonadb>=0.2.0 added to the examples extra for Arrow-native point
workflow examples.

Assets 2

27 Feb 12:18

print-sid8

v0.3.1

8b4c9e6

0.3.1

Rasteret 0.3.1

Made to beat cold starts. Index-first access to cloud-native GeoTIFF collections for ML and geospatial analysis.

pip install rasteret==0.3.1

Docs: terrafloww.github.io/rasteret | Discord: discord.gg/V5vvuEBc

What Rasteret does

Every cold start re-parses satellite image metadata over HTTP — per scene, per band. Rasteret parses those headers once, caches them in Parquet, and its own reader fetches pixels concurrently with no GDAL in the path. Up to 20x faster on cold starts.

Because the index is Parquet, it's also the table you work with — filter, join, enrich, and query with standard tools before you ever fetch a pixel.

import rasteret

collection = rasteret.build("earthsearch/sentinel-2-l2a",
                            name="s2", bbox=(...), date_range=(...))

sub = collection.subset(cloud_cover_lt=20, date_range=("2024-03-01", "2024-06-01"))
arr = sub.get_numpy(geometries=(-122.5, 37.7, -122.3, 37.9), bands=["B04", "B08"])

What's new in 0.3.1

`as_collection()` — lightweight re-entry from Arrow tables

Rasteret's design is that you should be free to enrich your dataset with whatever tools you know (DuckDB, Polars, PyArrow) and come back to Rasteret when you need pixels. as_collection() validates that the required columns and COG metadata are intact and wraps the table — no normalization, no COG parsing, no disk I/O.

base = rasteret.build("earthsearch/sentinel-2-l2a", name="s2", ...)
table = base.dataset.to_table()
table = table.append_column("split", splits)
table = table.append_column("label", labels)

collection = rasteret.as_collection(table, name="enriched", data_source=base.data_source)
collection.subset(split="train").get_numpy(geometries=..., bands=["B04", "B08"])

Three entry points, clear separation:

Function	When to use	Rebuilds?	Persists?
`build()` / `build_from_stac()` / `build_from_table()`	First-time ingest from STAC or external Parquet	Yes	Yes
`load()`	Reopen an existing Collection from disk	No	No
`as_collection()`	Re-wrap a read-ready Arrow table/dataset	No	No

What shipped in 0.3.0

This is a full revamp. Highlights:

License

AGPL-3.0 → Apache-2.0.

Dataset catalog

12 built-in datasets across Earth Search, Planetary Computer, and AlphaEarth Foundation. One line to build:

$ rasteret datasets list
earthsearch/sentinel-2-l2a    Sentinel-2 Level-2A                global         none
earthsearch/landsat-c2-l2     Landsat Collection 2 Level-2       global         required
earthsearch/naip              NAIP                               north-america  required
earthsearch/cop-dem-glo-30    Copernicus DEM 30m                 global         none
earthsearch/cop-dem-glo-90    Copernicus DEM 90m                 global         none
pc/sentinel-2-l2a             Sentinel-2 (Planetary Computer)    global         required
pc/io-lulc-annual-v02         ESRI 10m Land Use/Land Cover       global         required
pc/alos-dem                   ALOS World 3D 30m DEM              global         required
pc/nasadem                    NASADEM                            global         required
pc/esa-worldcover             ESA WorldCover                     global         required
pc/usda-cdl                   USDA Cropland Data Layer           conus          required
aef/v1-annual                 AlphaEarth Foundation Embeddings   global         none

The catalog is open and community-driven. Each entry is ~20 lines of Python. One PR adds a dataset, every user gets access on the next release.

Four output paths

All from the same Collection, all sharing the same async tile I/O:

ds  = collection.get_xarray(geometries=bbox, bands=["B04", "B08"])     # xarray.Dataset
arr = collection.get_numpy(geometries=bbox, bands=["B04", "B08"])      # [N, C, H, W]
gdf = collection.get_gdf(geometries=bbox, bands=["B04", "B08"])        # GeoDataFrame
dataset = collection.to_torchgeo_dataset(bands=["B04", "B03", "B02"])  # TorchGeo GeoDataset

TorchGeo adapter

to_torchgeo_dataset() returns a standard GeoDataset backed by Rasteret's async COG reader. Supports time series ([T, C, H, W]), per-sample labels via label_field, cross-CRS reprojection, mixed-resolution bands. Works with all TorchGeo samplers, IntersectionDataset, UnionDataset.

Your dataset is a table

Filter by cloud cover, date, bbox, split. Join with labels or AOI polygons. Query with DuckDB. Add train/val/test splits as columns. The Collection is the dataset.

filtered = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))
train = collection.subset(split="train")
clear = collection.where(ds.field("eo:cloud_cover") < 5.0)

Multi-cloud

S3, Azure Blob, GCS via Obstore. Requester-pays, signed URLs, credential providers all supported.

`build_from_table()`

Build a Collection from any Parquet with COG URLs — Source Cooperative exports, STAC GeoParquet, custom catalogs. No STAC API required.

Native dtype preservation

uint16 stays uint16 in tensors. No silent float32 promotion.

Major-TOM dataset creation on-the-fly example

Rebuild Major TOM-style patch-grid semantics from source Sentinel-2 COGs instead of image-in-Parquet. 3.9–6.5x faster than HF datasets Parquet-filter reads.

Benchmarks

Cold-start comparison with TorchGeo

Same AOIs, same scenes, same sampler, same DataLoader. No HTTP cache, no OS page cache.

Scenario	rasterio/GDAL	Rasteret	Ratio
Single AOI, 15 scenes	9.08 s	1.14 s	8x
Multi-AOI, 30 scenes	42.05 s	2.25 s	19x
Cross-CRS boundary, 12 scenes	12.47 s	0.59 s	21x

HF `datasets` baseline (Major TOM keyed patches)

Patches	HF `datasets` parquet filters	Rasteret index+COG	Speedup
120	46.83 s	12.09 s	3.88x
1000	771.59 s	118.69 s	6.50x

Requirements

Python 3.12+
rasterio>=1.4.3,<1.5.0 (geometry masking, CRS reprojection; not in the tile-read path)

Install

pip install rasteret==0.3.1

# Extras
pip install "rasteret[xarray]"       # + xarray output
pip install "rasteret[torchgeo]"     # + TorchGeo for ML pipelines
pip install "rasteret[aws]"          # + requester-pays buckets
pip install "rasteret[azure]"        # + Planetary Computer signed URLs

Rasteret is Apache-2.0, fully open source. All contributions welcome.

Docs · GitHub · Discord

Assets 2

22 Sep 06:44

print-sid8

v0.2.1

bde1229

v0.2.1

What's Changed

move to uv package manger, and also fixes to CI by @satishjasthi in #15
Patch fix for aws s3 url signing by @print-sid8 a121a2f
Rewrite example codes and Readme to be easier, and also add missing imports in readme code blocks, reported by @krishnaglodha

Full Changelog: v0.1.20...v0.2.1

Contributors

satishjasthi, krishnaglodha, and print-sid8

Assets 2

20 Aug 08:56

print-sid8

v0.1.20

482d801

v0.1.20 - performance improvements

Release v0.1.20 - Stable Performance Release

Performance Improvements:

Migrated from httpx to aiohttp for HTTP requests
5x faster rasteret collection creation
2x faster COG data querying

Stability:

Library now considered stable for its current features, future API changes can still be expected in coming weeks.

uv pip install rasteret will install 0.1.20, pypi automated pipeline is broken, for now manually pushed the version

This release addresses issue #6 with the migration to aiohttp, delivering significant performance gains for both collection indexing and COG data access operations.

Assets 2

Releases: terrafloww/rasteret

0.3.5

Changelog

v0.3.5

Added

Changed

Fixed

Tested

Uh oh!

0.3.4

v0.3.4

New features 🚀

Changed 🪛

Fixed 🔨

Tested 🗒️

Uh oh!

0.3.3

v0.3.3

Performance 🚀

Refactored 🧹

Changed

Uh oh!

0.3.2

New Features 🎉

Improvements 🚀

Additions

Uh oh!

0.3.1

Rasteret 0.3.1

What Rasteret does

What's new in 0.3.1

as_collection() — lightweight re-entry from Arrow tables

What shipped in 0.3.0

License

Dataset catalog

Four output paths

TorchGeo adapter

Your dataset is a table

Multi-cloud

build_from_table()

Native dtype preservation

Major-TOM dataset creation on-the-fly example

Benchmarks

Cold-start comparison with TorchGeo

HF datasets baseline (Major TOM keyed patches)

Requirements

Install

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!

v0.1.20 - performance improvements

Release v0.1.20 - Stable Performance Release

Performance Improvements:

Stability:

Uh oh!

`as_collection()` — lightweight re-entry from Arrow tables

`build_from_table()`

HF `datasets` baseline (Major TOM keyed patches)