EnvDataPrep: High-performance Environmental Data Pre-processing

Why EnvDataPrep?

EnvDataPrep aims to help environmental scientists overcome common challenges in handling environmental datasets, incluidng:

Insufficient disk space for massive datasets
Complex conversions between different file formats
Time-consuming geospatial operations
And more ...

EnvDataPrep is designed for high performance and syntax simplicity. By leveraging vectorized operations, parallelism, and industry-standard libraries like NumPy, Xarray, and netCDF4, it streamlines heavy and complex data preparation tasks into efficient and easy-to-use APIs.

Core Capacities

Currently, the main functionality is subsetting netCDF files. Typical satellite products (e.g., TROPOMI NO₂) and model simulations (e.g., WRF outputs) can shrink by a large fraction (e.g., 90%) when you keep only the data fields you need.

Example of subsetting netCDF files

import glob
import os

import envdataprep as edp

# Directories
INPUT_DIR = "path/to/input/directory"
OUTPUT_DIR = "path/to/output/directory"

# Input files (using TROPOMI NO2 satellite products as an example)
input_files = glob.glob(os.path.join(INPUT_DIR, "S5P*.nc"))

# Explore all available variables
all_vars = edp.list_netcdf_vars(input_files[0])
print(*all_vars, sep="\n")

# Select the variables to keep
selected_vars = [
    "PRODUCT/latitude",
    "PRODUCT/longitude",
    "PRODUCT/time_utc",
    "PRODUCT/qa_value",
    "PRODUCT/nitrogendioxide_tropospheric_column",
    "PRODUCT/nitrogendioxide_tropospheric_column_precision",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/solar_zenith_angle",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/viewing_zenith_angle",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/latitude_bounds",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/longitude_bounds",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/surface_altitude",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/eastward_wind",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/northward_wind",
]

# Subset the netCDF files in parallel
edp.subset_netcdf(
    nc_input=input_files,
    output_dir=OUTPUT_DIR,
    keep_vars=selected_vars,
    workers=8,
)

Installation

Requirements: Python 3.12+.

Install from PyPI:

pip install envdataprep

⚠️ Disclaimer

Due to the massive scale and inherent diversity of environmental data, some edge cases may remain unexplored. For critical research or production workflows, it is strongly recommended to manually validate processed outputs.

If you encounter any discrepancies or unexpected behavior, please open an issue.

License

This project is licensed under the MIT License.

⬆ Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
envdataprep		envdataprep
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EnvDataPrep: High-performance Environmental Data Pre-processing

Why EnvDataPrep?

Core Capacities

Example of subsetting netCDF files

Installation

⚠️ Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EnvDataPrep: High-performance Environmental Data Pre-processing

Why EnvDataPrep?

Core Capacities

Example of subsetting netCDF files

Installation

⚠️ Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages