Releases: terrafloww/rasteret
0.3.5
Changelog
v0.3.5
Added
- Descriptor-backed dual-surface planning: catalog entries can now
describe separate record-table, index, and collection surfaces via
record_table_uri,index_uri, andcollection_uri. This lets Rasteret
keep build-time metadata sources separate from published runtime
collections. Collection.head(): first-class metadata preview API that uses the
narrow record index when available instead of forcing a wide collection
scan.- Internal Parquet read planner:
ParquetReadPlannerkeeps index-side and
wide-scan filter state coherent acrosssubset(),where(),len(),
head(), pixel reads, and TorchGeo entry points. - Hugging Face streaming runtime:
HFStreamingSourceand batch iterators
now provide a stable metadata path forhf://datasets/...collections
without routing through the olderdatasetsstreaming shutdown path.
Changed
- Index-first runtime filtering: when a collection has both a narrow index
and a wide read-ready surface, Rasteret now plans metadata filters against
the index first, then carries compatible predicates into the wide scan. - GeoParquet bbox contract: bbox filtering now targets the canonical
bboxstruct (xmin,ymin,xmax,ymax) instead of older scalar bbox
columns. Catalog tests, CLI fixtures, and filter tests were updated to
match the GeoParquet 1.1 shape. - TorchGeo filter propagation:
to_torchgeo_dataset()now respects
collection filters and geometry / bbox narrowing before chip sampling
starts, reducing unnecessary raster candidates for gridded reads. - Local Parquet metadata reuse: local datasets prefer a
_metadata
sidecar when present, and Rasteret now keeps an in-process Parquet dataset
cache to reduce repeated footer/schema setup during interactive sessions. - Catalog surface reporting: CLI output and descriptor helpers now surface
record-table, index, and collection paths explicitly instead of collapsing
everything intogeoparquet_uri.
Fixed
- Hugging Face filter execution: Arrow expression handling on HF-backed
metadata batches now fails more clearly and avoids the unstable runtime path
that could crash during shutdown or long scans. - AEF runtime aliasing: the built-in AEF descriptor now points at the
published Terrafloww collection/index surfaces with explicit field-role and
filter-capability hints. - Read-path correctness on filtered collections:
len(),head(),
sample_points(), TorchGeo reads, and other collection-backed reads now
stay aligned when filters are staged across both record-index and wide-data
surfaces.
Tested
- Expanded
test_collection_filtersfor bbox-struct filtering and TorchGeo
prefilter behavior. - Expanded
test_catalog,test_cli, andtest_public_api_surfacefor the
new descriptor surface roles and runtime load/build paths. - Expanded
test_huggingface,test_execution, andtest_torchgeo_adapter
for the new HF runtime path, filter propagation, and collection-read
behavior.
0.3.4
v0.3.4
New features 🚀
- HuggingFace integration:
rasteret.load("hf://datasets/<org>/<repo>")
resolves remote Parquet shards viahuggingface_huband loads them as a
Collection. No local clone or download step needed. - AEF v1 Annual Rasteret Collection: prebuilt read-ready collection
published on HuggingFace
and Source Cooperative.
235K tiles, 64-band int8 embeddings, global coverage 2017–2024. - AEF similarity search tutorial (
notebooks/07_aef_similarity_search):
end-to-end embedding similarity search usingsample_points,get_xarray,
get_gdf, and TorchGeoGridGeoSampler. Demonstrates DuckDB Arrow-native
pivot for reference vectors and lonboard GPU-accelerated visualization.
Changed 🪛
- Unified tile engine:
RasterAccessorconsolidated into a single
_read_tile()code path shared byget_xarray,get_numpy,get_gdf,
and TorchGeo__getitem__. Four divergent tile-read implementations
replaced with one. - COGReader simplified: removed duplicate decode paths, tightened the
fetch → decompress → crop pipeline, reduced code surface by ~200 lines. - TorchGeo edge-chip hardening: empty-read validation returns
nodata-filled tensors for chips outside coverage; positive-overlap
filtering skips false bbox-only candidates; fallback loop fills chips
with nodata when all candidate tiles fail instead of crashing the
DataLoader. - Error surfacing:
get_gdfandget_numpywarn on partial
band/geometry/record failures instead of returning silent empty results.
Point geometry AOIs raiseUnsupportedGeometryErrorpointing to
sample_points(). Exception chaining (raise ... from e) throughout
the read pipeline.
Fixed 🔨
sample_pointsCOG read path: tile metadata validation was skipping
valid tiles when the source raster had multiple matching records.xr_combineparameter now correctly plumbed throughget_xarray().
Tested 🗒️
test_torchgeo_error_propagation: edge-chip, empty-read, and fallback
loop coverage.test_huggingface: URI resolution and table loading mocks.test_public_api_surface: validates all public imports fromrasteret.test_public_network_smoke: live AEF reads and point sampling parity
againstrasterio.sample().- Extended
test_executionwithget_gdferror path coverage.
0.3.3
v0.3.3
Performance 🚀
- Arrow-batch native point sampling:
sample_points()internals now use
vectorized NumPy gathers andTable.from_batchesinstead of per-sample
Python append +pa.concat_tables. Eliminates row materialization in the
hot loop. - COGReader session reuse:
read_cog()accepts a sharedreader=
parameter. Point sampling across multiple rasters now reuses a single
COGReader(and its HTTP/2 connection pool) instead of creating one per
raster, reducing connection overhead for multi-scene workloads.
Refactored 🧹
- Dedicated
point_samplingmodule: point sampling ownership moved from
execution.pytorasteret.core.point_sampling.execution.pystays
focused on area/chip reads (get_xarray,get_numpy,get_gdf). POINT_SAMPLES_SCHEMAdefined intypes.pyas the single source of
truth for point sample output columns. Nullable fields explicitly modeled.- Point input helpers consolidated:
ensure_point_geoarrow,
candidate_point_indices_for_raster, and related helpers moved into
geometry.py. - Strict tile/source alignment guard in
RasterAccessor.sample_points:
validates that tile metadata matches the source raster before sampling.
Changed
- Landing page (
docs/index.md) code example simplified: single PyArrow
import, cleanersample_pointscall, HF benchmark collapsed into
admonition.
0.3.2
New Features 🎉
Collection.sample_points(): first-class point sampling API returning a
pyarrow.Table(point_index,record_id,datetime,band,value,
CRS columns, and metadata fields). Supportsmatch="all"and
match="latest".- New guide: Point Sampling and Masking.
Improvements 🚀
- Masking controls surfaced at Collection level:
get_xarray(),get_gdf(), andget_numpy()now accept
all_touched=...directly. - Table-native point sampling inputs:
Collection.sample_points(...)now accepts Arrow tables and common
dataframe/relation inputs withx_column/y_columnor
geometry_column(WKB/GeoArrow/shapely point column). - Point sampling aligns to rasterio
sample()semantics (nearest-pixel
index math) for deterministic parity on real COGs. - Core read internals are more Arrow-native:
iterate_rasters()now scans columnar batches (nobatch.to_pylist()in
the core read iterator),infer_data_source()uses filteredscanner.head(1),- multi-CRS detection in xarray path streams
proj:epsgcounts per batch, sample_points()uses vectorizedgeoarrow.point_coords.
- Error messaging for non-OGC binary geometry input now explicitly points
DuckDB users toST_AsWKB(geom)when needed. - Network parity coverage is tighter:
- AOI windowing now matches rasterio
geometry_window()edge semantics
exactly (fixes the WorldCover 1-pixel mismatch), - transient STAC API timeouts are retried during live builds,
- the AEF south-up TorchGeo oracle path is corrected for manual/explicit
parity runs viaWarpedVRT.
- AOI windowing now matches rasterio
Additions
sedonadb>=0.2.0added to theexamplesextra for Arrow-native point
workflow examples.
0.3.1
Rasteret 0.3.1
Made to beat cold starts. Index-first access to cloud-native GeoTIFF collections for ML and geospatial analysis.
pip install rasteret==0.3.1Docs: terrafloww.github.io/rasteret | Discord: discord.gg/V5vvuEBc
What Rasteret does
Every cold start re-parses satellite image metadata over HTTP — per scene, per band. Rasteret parses those headers once, caches them in Parquet, and its own reader fetches pixels concurrently with no GDAL in the path. Up to 20x faster on cold starts.
Because the index is Parquet, it's also the table you work with — filter, join, enrich, and query with standard tools before you ever fetch a pixel.
import rasteret
collection = rasteret.build("earthsearch/sentinel-2-l2a",
name="s2", bbox=(...), date_range=(...))
sub = collection.subset(cloud_cover_lt=20, date_range=("2024-03-01", "2024-06-01"))
arr = sub.get_numpy(geometries=(-122.5, 37.7, -122.3, 37.9), bands=["B04", "B08"])What's new in 0.3.1
as_collection() — lightweight re-entry from Arrow tables
Rasteret's design is that you should be free to enrich your dataset with whatever tools you know (DuckDB, Polars, PyArrow) and come back to Rasteret when you need pixels. as_collection() validates that the required columns and COG metadata are intact and wraps the table — no normalization, no COG parsing, no disk I/O.
base = rasteret.build("earthsearch/sentinel-2-l2a", name="s2", ...)
table = base.dataset.to_table()
table = table.append_column("split", splits)
table = table.append_column("label", labels)
collection = rasteret.as_collection(table, name="enriched", data_source=base.data_source)
collection.subset(split="train").get_numpy(geometries=..., bands=["B04", "B08"])Three entry points, clear separation:
| Function | When to use | Rebuilds? | Persists? |
|---|---|---|---|
build() / build_from_stac() / build_from_table() |
First-time ingest from STAC or external Parquet | Yes | Yes |
load() |
Reopen an existing Collection from disk | No | No |
as_collection() |
Re-wrap a read-ready Arrow table/dataset | No | No |
What shipped in 0.3.0
This is a full revamp. Highlights:
License
AGPL-3.0 → Apache-2.0.
Dataset catalog
12 built-in datasets across Earth Search, Planetary Computer, and AlphaEarth Foundation. One line to build:
$ rasteret datasets list
earthsearch/sentinel-2-l2a Sentinel-2 Level-2A global none
earthsearch/landsat-c2-l2 Landsat Collection 2 Level-2 global required
earthsearch/naip NAIP north-america required
earthsearch/cop-dem-glo-30 Copernicus DEM 30m global none
earthsearch/cop-dem-glo-90 Copernicus DEM 90m global none
pc/sentinel-2-l2a Sentinel-2 (Planetary Computer) global required
pc/io-lulc-annual-v02 ESRI 10m Land Use/Land Cover global required
pc/alos-dem ALOS World 3D 30m DEM global required
pc/nasadem NASADEM global required
pc/esa-worldcover ESA WorldCover global required
pc/usda-cdl USDA Cropland Data Layer conus required
aef/v1-annual AlphaEarth Foundation Embeddings global none
The catalog is open and community-driven. Each entry is ~20 lines of Python. One PR adds a dataset, every user gets access on the next release.
Four output paths
All from the same Collection, all sharing the same async tile I/O:
ds = collection.get_xarray(geometries=bbox, bands=["B04", "B08"]) # xarray.Dataset
arr = collection.get_numpy(geometries=bbox, bands=["B04", "B08"]) # [N, C, H, W]
gdf = collection.get_gdf(geometries=bbox, bands=["B04", "B08"]) # GeoDataFrame
dataset = collection.to_torchgeo_dataset(bands=["B04", "B03", "B02"]) # TorchGeo GeoDatasetTorchGeo adapter
to_torchgeo_dataset() returns a standard GeoDataset backed by Rasteret's async COG reader. Supports time series ([T, C, H, W]), per-sample labels via label_field, cross-CRS reprojection, mixed-resolution bands. Works with all TorchGeo samplers, IntersectionDataset, UnionDataset.
Your dataset is a table
Filter by cloud cover, date, bbox, split. Join with labels or AOI polygons. Query with DuckDB. Add train/val/test splits as columns. The Collection is the dataset.
filtered = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))
train = collection.subset(split="train")
clear = collection.where(ds.field("eo:cloud_cover") < 5.0)Multi-cloud
S3, Azure Blob, GCS via Obstore. Requester-pays, signed URLs, credential providers all supported.
build_from_table()
Build a Collection from any Parquet with COG URLs — Source Cooperative exports, STAC GeoParquet, custom catalogs. No STAC API required.
Native dtype preservation
uint16 stays uint16 in tensors. No silent float32 promotion.
Major-TOM dataset creation on-the-fly example
Rebuild Major TOM-style patch-grid semantics from source Sentinel-2 COGs instead of image-in-Parquet. 3.9–6.5x faster than HF datasets Parquet-filter reads.
Benchmarks
Cold-start comparison with TorchGeo
Same AOIs, same scenes, same sampler, same DataLoader. No HTTP cache, no OS page cache.
| Scenario | rasterio/GDAL | Rasteret | Ratio |
|---|---|---|---|
| Single AOI, 15 scenes | 9.08 s | 1.14 s | 8x |
| Multi-AOI, 30 scenes | 42.05 s | 2.25 s | 19x |
| Cross-CRS boundary, 12 scenes | 12.47 s | 0.59 s | 21x |
HF datasets baseline (Major TOM keyed patches)
| Patches | HF datasets parquet filters |
Rasteret index+COG | Speedup |
|---|---|---|---|
| 120 | 46.83 s | 12.09 s | 3.88x |
| 1000 | 771.59 s | 118.69 s | 6.50x |
Requirements
- Python 3.12+
rasterio>=1.4.3,<1.5.0(geometry masking, CRS reprojection; not in the tile-read path)
Install
pip install rasteret==0.3.1
# Extras
pip install "rasteret[xarray]" # + xarray output
pip install "rasteret[torchgeo]" # + TorchGeo for ML pipelines
pip install "rasteret[aws]" # + requester-pays buckets
pip install "rasteret[azure]" # + Planetary Computer signed URLsRasteret is Apache-2.0, fully open source. All contributions welcome.
v0.2.1
What's Changed
- move to uv package manger, and also fixes to CI by @satishjasthi in #15
- Patch fix for aws s3 url signing by @print-sid8 a121a2f
- Rewrite example codes and Readme to be easier, and also add missing imports in readme code blocks, reported by @krishnaglodha
Full Changelog: v0.1.20...v0.2.1
v0.1.20 - performance improvements
Release v0.1.20 - Stable Performance Release
Performance Improvements:
- Migrated from httpx to aiohttp for HTTP requests
- 5x faster rasteret collection creation
- 2x faster COG data querying
Stability:
Library now considered stable for its current features, future API changes can still be expected in coming weeks.
uv pip install rasteret will install 0.1.20, pypi automated pipeline is broken, for now manually pushed the version
This release addresses issue #6 with the migration to aiohttp, delivering significant performance gains for both collection indexing and COG data access operations.