This repository is designed to download, preprocess, standardize, and catalog climate data from the Copernicus Climate Data Store (CDS). It provides a complete set of tools for managing climate reanalysis datasets (like ERA5 and CERRA) including downloading raw data, deriving new variables, interpolating to reference grids, standardizing formats, and maintaining catalogues of available data.
The repository automates the process of:
- Downloading climate data from CDS using the CDS API
- Deriving new variables from raw data (e.g., calculating wind speed from u and v components)
- Interpolating datasets to reference grids for spatial consistency
- Standardizing Unit transformations
- Cataloguing available datasets and generating visual reports
- Edit CSV files in
requests/to specify:- Which variables to download
- Year ranges
- CDS API parameters
- Output paths and temporal resolution
- Interpolation method (native, gr006, etc.)
requests/*.csv → scripts/download/*.py → CDS API → NetCDF files
- Scripts read CSVs
- Create CDS API requests
- Download raw data as NetCDF files
- Files saved to:
{base}/{product_type}/{dataset}/{temporal_resolution}/{interpolation}/{variable}/
Raw NetCDF → scripts/derived/*.py → Derived NetCDF
- Scripts identify "derived" variables from CSVs
- Load necessary raw data files
- Perform calculations (e.g., wind speed from components)
- Resample to daily values if needed
- Save derived variables with temporal resolution metadata
Raw NetCDF → scripts/interpolation/*.py → Interpolated NetCDF (stored as derived)
- Scripts identify variables needing interpolation from CSVs
- Load reference grid specified in the
interpolation_filecolumn - Apply conservative interpolation to regrid data
- Save to derived directory with interpolation method (e.g., gr006)
Derived/Raw NetCDF → scripts/standardization/*.py → Standardized NetCDF
- Apply unit conversions
- Update metadata attributes
- Ensure CF convention compliance
All NetCDF files → scripts/catalogue/produce_catalog.py → CSV + PDF reports
- Scan all output directories
- Check file existence for each year
- Generate availability reports with temporal resolution
- Create visual heatmaps
- Publish via GitHub Actions nightly
Contains CSV files that define what data to download
- Each CSV corresponds to a CDS catalogue (e.g.,
reanalysis-era5-single-levels.csv) - Columns include:
filename_variable: Variable name for saved filescds_request_variable: Variable name in CDS APIcds_years_start/end: Year range to downloadproduct_type:raworderived(interpolated data is stored as derived)temporal_resolution: hourly, daily, 3hourly, 6hourly, monthlyinterpolation: native (non-interpolated) or grid specification (e.g., gr006)interpolation_file: Reference grid file for interpolation (if needed)output_path: Base directory for saving datascript: Which Python script handles this dataset
Example: A row specifying to download u10 (10m wind u-component) for years 2022-2024 from ERA5.
Organized directory containing all Python scripts:
Scripts that download data from CDS
- One script per CDS catalogue (e.g.,
reanalysis-era5-single-levels.py) - Reads request CSVs and creates API requests
- Downloads files to directory structure:
{base}/{product_type}/{dataset}/{temporal_resolution}/{interpolation}/{variable}/ - Skip files that already exist
Centralized utility functions
utils.py: Core functions for path construction and file downloadsbuild_output_path(): Constructs directory paths with temporal resolution and interpolationload_output_path_from_row(): Extracts output path from CSV rowload_input_path_from_row(): Extracts input path from CSV rowload_path_from_df(): Lookup path for a variable in DataFramedownload_files(): Orchestrates parallel downloads based on CSV configuration
create_folder_structure.py: Creates complete directory structure from CSVs without downloading
Scripts that calculate derived variables from raw data
- Example:
reanalysis-era5-single-levels.pycalculates:sfcwind(wind speed) fromu10andv10components using:sfcwind = √(u10² + v10²)
- Uses
operations.pywhich provides utility functions:sfcwind_from_u_v(): Calculate wind speed from componentsresample_to_daily(): Aggregate hourly data to daily statistics
Workflow:
- Read CSV to identify variables marked as "derived"
- Load required raw data files
- Apply mathematical operations
- Resample to daily values if needed
- Save derived variables with new temporal resolution
Scripts that interpolate datasets to reference grids
- Example:
reanalysis-cerra-single-levels.pyinterpolates CERRA data - Reference grid specified in
interpolation_filecolumn of request CSVs - Uses conservative interpolation method via xESMF
- Saves to derived directory with interpolation method identifier (e.g., gr006)
Workflow:
- Read CSV to identify variables needing interpolation (interpolation != 'native')
- Load reference grid from specified file
- Apply conservative_normed interpolation to regrid data
- Save to:
{base}/derived/{dataset}/{temporal_resolution}/{interpolation}/{variable}/
Scripts that standardize variables to CF conventions
- Example:
derived-era5-single-levels-daily-statistics.pycontains functions like:tp(): Convert precipitation from m/day to kg/m²/s (flux)e(): Convert evaporation with proper units and attributesssrd(): Convert solar radiation from J/m² to W/m²
Scripts that generate visual catalogues of available data
produce_catalog.py: Scans directories, creates CSV catalogues showing data availability, generates heatmap visualizationsgenerate_resumen.py: Creates summary reports- Output saved to
catalogues/catalogues/andcatalogues/images/
Jupyter notebooks for exploration and testing
JSON files documenting metadata and provenance for each variable
- Includes:
- Variable names and mappings
- Frequency (hourly, daily, monthly)
- Product type (raw or derived)
- Links to CMIP6 CMOR tables for standard definitions
Example:
{
"uas": {
"var_name": "u10",
"provenance": "https://github.com/PCMDI/cmip6-cmor-tables/...",
"frequency": "hourly",
"type_product": "raw"
}
}Output directory for catalogues and visualizations
catalogues/: CSV files listing all variables, datasets, date ranges, file pathsimages/: PDF heatmaps showing data availability (green=downloaded, orange=partial, red=missing)- Updated nightly via GitHub Actions CI/CD
Enhanced structure with temporal resolution and interpolation metadata:
{base}/{product_type}/{dataset}/{temporal_resolution}/{interpolation}/{variable}/
Where:
product_type:raworderived(interpolated data stored as derived)temporal_resolution: hourly, daily, 3hourly, 6hourly, monthlyinterpolation: native (non-interpolated) or grid specification (e.g., gr006)
Examples:
- Raw hourly ERA5:
/lustre/.../raw/reanalysis-era5-single-levels/hourly/native/u10/ - Derived daily wind:
/lustre/.../derived/reanalysis-era5-single-levels/daily/native/sfcwind/ - Interpolated CERRA:
/lustre/.../derived/reanalysis-cerra-single-levels/3hourly/gr006/t2m/
Format: {variable}_{dataset}_{date}.nc
Date formats:
{year}{month}: For large datasets like CERRA (monthly files for faster downloads){year}: For smaller datasets saved annually
Examples:
u10_reanalysis-era5-single-levels_2023.ncsfcwind_reanalysis-cerra-single-levels_202301.nc
- GitHub Actions workflows:
catalog_executor.yml: Runs nightly to update cataloguesrun_all_requests_scripts.yml: Can trigger download scripts
- SLURM scripts:
scripts/download/launch_all_requests_scripts.sh: Batch job launcher for HPC environments- Designed for cluster computing with job scheduling
- reanalysis-era5-single-levels
- reanalysis-cerra-single-levels
- reanalysis-cerra-land
- derived-era5-single-levels-daily-statistics
- Edit
requests/reanalysis-era5-single-levels.csvto specify years and variables - Run:
python scripts/download/reanalysis-era5-single-levels.py - Raw data downloads to:
{base}/raw/{dataset}/{temporal_resolution}/native/{variable}/
- Ensure raw data is downloaded
- Run:
python scripts/derived/reanalysis-era5-single-levels.py - Derived variables saved to:
{base}/derived/{dataset}/{temporal_resolution}/native/{variable}/
- Ensure raw data is downloaded
- Specify reference grid in the
interpolation_filecolumn of request CSV - Run:
python scripts/interpolation/reanalysis-cerra-single-levels.py - Interpolated data saved to:
{base}/derived/{dataset}/{temporal_resolution}/{grid_spec}/{variable}/
- Run:
python scripts/utilities/create_folder_structure.py --dry-run(preview) - Run:
python scripts/utilities/create_folder_structure.py(create) - Creates all directories based on CSV configurations without downloading
- Run:
python scripts/catalogue/produce_catalog.py - Generates CSV catalogues and PDF visualizations in
catalogues/ - Shows data availability status with temporal resolution metadata
This repository is part of the C3S Atlas ecosystem and uses the same conda environment. It serves as the data acquisition and preprocessing layer, providing standardized climate data for downstream analysis tools.
The c3s-cds repository is a comprehensive data management system for climate reanalysis datasets. It automates the entire pipeline from CDS API downloads through interpolation and standardization to catalog generation, making climate data readily accessible and well-documented for scientific research.