BLN-AQ is a distributed spatio-temporal modeling system for estimating and forecasting air quality in Berlin, Germany, with a focus on PM2.5 measurements from citizen-science sensors.
The project combines:
- Time-series forecasting using foundation models
- Spatial interpolation via geostatistical methods
to generate continuous air-quality estimates over the Berlin metropolitan area.
Conceptually, this project builds on techniques explored in BerlinWeatherTimeSeriesAnalysis, but extends them into the spatial domain and targets particulate matter rather than meteorological variables. Additionally, it uses more robust infrastructure and demonstrates ML systems architecture design and implementation.
The repository implements a distributed data pipeline composed of continuously running services and on-demand batch jobs:
Sensor Community -> Ingest -> Aggregate -> Forecast + Interpolate -> Frontend
Here is an example of the front end (this was not the primary focus of the project).
Contains experimental notebooks used to evaluate forecasting approaches and spatial interpolation behavior.
Stores historical weather data, intermediate artifacts, and produced forecasts.
Contains the production pipeline code responsible for ingestion, processing, forecasting, and visualization.
This directory contains containerized services and batch-compute jobs orchestrated using Kubernetes.
Continuously downloads particulate matter data from archive.sensor.community and stores daily CSV dumps in persistent volume storage (PVC).
Uses Apache Spark to:
- Clean and harmonize sensor time series
- Resample measurements
- Align timestamps
- Produce Parquet datasets for downstream modeling
This step is executed on demand to avoid idling expensive compute resources.
Performs two tightly coupled tasks:
Uses Chronos2 in in-context learning mode to produce 3-day forecasts of average PM2.5 for each sensor time series.
Uses PyKrige to krige the predicted pollution fields over a regular grid covering Berlin, producing continuous spatial estimates even where no sensors are physically present.
Outputs are written to Parquet for efficient reuse.
A Streamlit service that visualizes historical forecasts and interpolated pollution fields on an interactive Berlin map.
The ingest and front-end services run continuously.
The aggregate and predict-interpolate jobs are executed via orchestrate-frontend.sh in order to prevent expensive resources (GPUs and Spark clusters) from idling.
This design cleanly separates:
- Real-time ingestion and presentation
from - Batch-oriented compute-heavy modeling jobs
- Docker
- Kubernetes
- Persistent Volume Claims (PVCs)
- Apache Spark
- Parquet
- Python ETL pipelines
- Chronos2 (time-series foundation model)
- PyKrige (spatial interpolation)
- NumPy, pandas, SciPy
- Streamlit
- Geospatial plotting and mapping libraries
Berlin’s air-quality sensor network is sparse, noisy, and heterogeneous.
This project bridges these deficiencies by combining:
- Foundation models for temporal generalization
- Geostatistical techniques for spatial inference
to produce policy-relevant pollution maps from citizen science data.
- Forecasted pollutant time series per sensor
- Interpolated pollution maps over Berlin
- A reproducible data and modeling pipeline
- Presentation at
UT-Presentation.pdf
