This guide explains how to prepare and use data with hydromodel, covering both public CAMELS datasets and custom data.
hydromodel supports two main data sources:
-
Public CAMELS Datasets - Using hydrodataset package
- 11 global CAMELS variants (US, GB, AUS, BR, CL, etc.)
- Automatic download and caching
- Standardized format and quality-controlled
-
Custom Data - Using hydrodatasource package
- Your own basin data
- Flexible data organization
- Integration with cloud storage
pip install hydrodatasethydromodel automatically uses default paths, but you can customize:
Default paths:
- Windows:
C:\Users\YourUsername\hydromodel_data\ - macOS/Linux:
~/hydromodel_data/
To customize, create ~/hydro_setting.yml:
local_data_path:
root: 'D:/data'
datasets-origin: 'D:/data' # CAMELS datasets locationImportant: Provide only the datasets-origin directory. The system automatically appends the dataset name (e.g., CAMELS_US, CAMELS_GB).
Example: If your data is in D:/data/CAMELS_US/, set datasets-origin: 'D:/data'.
The data downloads automatically on first use:
from hydrodataset.camels_us import CamelsUs
from hydrodataset import SETTING
# Initialize dataset (auto-downloads if not present)
data_path = SETTING["local_data_path"]["datasets-origin"]
ds = CamelsUs(data_path, download=True)
# Get available basins
basin_ids = ds.read_object_ids()
print(f"Downloaded {len(basin_ids)} basins")Note: First download may take 30-120 minutes depending on dataset size. CAMELS-US is ~70GB.
from hydromodel.trainers.unified_calibrate import calibrate
config = {
"data_cfgs": {
"data_source_type": "camels_us", # Dataset name
"basin_ids": ["01013500", "01022500"],
"train_period": ["1990-10-01", "2000-09-30"],
"test_period": ["2000-10-01", "2010-09-30"],
"warmup_length": 365,
"variables": ["precipitation", "potential_evapotranspiration", "streamflow"]
},
# ... other configs
}
results = calibrate(config)| Dataset | Region | Basins | Package Name |
|---|---|---|---|
| CAMELS-US | United States | 671 | camels_us |
| CAMELS-GB | Great Britain | 671 | camels_gb |
| CAMELS-AUS | Australia | 222 | camels_aus |
| CAMELS-BR | Brazil | 897 | camels_br |
| CAMELS-CL | Chile | 516 | camels_cl |
| CAMELS-CH | Switzerland | 331 | camels_ch |
| CAMELS-DE | Germany | 1555 | camels_de |
| CAMELS-DK | Denmark | 304 | camels_dk |
| CAMELS-FR | France | 654 | camels_fr |
| CAMELS-NZ | New Zealand | 70 | camels_nz |
| CAMELS-SE | Sweden | 54 | camels_se |
Usage example:
# Use different datasets by changing data_source_type
config["data_cfgs"]["data_source_type"] = "camels_gb"
config["data_cfgs"]["basin_ids"] = ["28015"] # GB basin IDCAMELS datasets provide standardized variables:
Time Series Variables:
precipitation(mm/day or mm/hour)potential_evapotranspiration(mm/day)streamflow(mm/day or m³/s)temperature(°C)- And more depending on dataset
Basin Attributes:
area(km²)elevation(m)latitude,longitude- Climate, soil, vegetation attributes
For detailed documentation, see:
pip install hydrodatasourceCreate a directory with this structure:
my_basin_data/
├── attributes/
│ └── attributes.csv # Basin metadata (required)
├── timeseries/
│ ├── 1D/ # Daily time series
│ │ ├── basin_001.csv # One file per basin
│ │ ├── basin_002.csv
│ │ └── ...
│ └── 1D_units_info.json # Variable units (required)
└── shapes/ # Basin boundaries (optional)
└── basins.shp
Minimum required columns: basin_id and area (km²)
basin_id,area,lat,lon,elevation
basin_001,1250.5,30.5,105.2,850
basin_002,856.3,31.2,106.1,920Important:
basin_id: String identifier (matches filename)area: Basin area in km²- Other columns optional but recommended
Required column: time. Other columns are your variables.
time,prcp,PET,streamflow
1990-01-01,5.2,2.1,45.3
1990-01-02,0.0,2.3,42.1
1990-01-03,12.5,1.8,58.7Important:
timeformat:YYYY-MM-DD(for daily data)- Variable names: Lowercase, underscores for multi-word
- Missing values: Use empty cells or
NaN(not-9999or0) - No duplicate time stamps
Define physical units for all variables:
{
"prcp": "mm/day",
"PET": "mm/day",
"streamflow": "m^3/s",
"temp": "degC"
}Common units:
- Precipitation/ET:
mm/dayormm/hour - Streamflow:
m^3/sormm/day - Temperature:
degCorK - Area:
km^2
from hydrodatasource.reader.data_source import SelfMadeHydroDataset
# Initialize dataset
dataset = SelfMadeHydroDataset(
data_path="D:/my_basin_data",
time_unit="1D"
)
# Check basins
basin_ids = dataset.read_object_ids()
print(f"Found {len(basin_ids)} basins: {basin_ids}")
# Check time series
data = dataset.read_timeseries(
gage_id_lst=["basin_001"],
t_range=["1990-01-01", "2000-12-31"],
var_lst=["prcp", "PET", "streamflow"]
)
print(f"Data shape: {data['1D'].shape}") # [n_basins, n_time, n_vars]from hydromodel.trainers.unified_calibrate import calibrate
config = {
"data_cfgs": {
"data_source_type": "selfmadehydrodataset", # Use custom data
"data_source_path": "D:/my_basin_data", # Your data path
"basin_ids": ["basin_001", "basin_002"],
"train_period": ["1990-01-01", "2000-12-31"],
"test_period": ["2001-01-01", "2010-12-31"],
"warmup_length": 365,
},
"model_cfgs": {
"model_name": "xaj_mz",
},
"training_cfgs": {
"algorithm": "SCE_UA",
"loss_func": "RMSE",
"output_dir": "results",
"experiment_name": "my_basins",
"rep": 10000,
"ngs": 100,
},
"evaluation_cfgs": {
"metrics": ["NSE", "KGE", "RMSE"],
},
}
results = calibrate(config)Flood event data is designed for event-based hydrological modeling where you focus on specific flood episodes rather than continuous time series. This is particularly useful for:
- Flood forecasting and warning systems
- Peak flow estimation
- Event-based rainfall-runoff analysis
- Unit hydrograph calibration
| Feature | Continuous Data | Flood Event Data |
|---|---|---|
| Data Structure | Complete time series | Individual flood events with gaps |
| Input Features | 2D: [prcp, PET] | 4D: [prcp, PET, marker, event_id] |
| Warmup Handling | Removed after simulation | Included in each event (NaN markers) |
| Time Coverage | Full period | Only flood periods + warmup |
| Use Case | Long-term water balance | Flood peak prediction |
Each flood event contains:
event = {
"rain": np.array([...]), # Precipitation (with NaN in warmup)
"ES": np.array([...]), # Evapotranspiration
"inflow": np.array([...]), # Streamflow (with NaN in warmup)
"flood_event_markers": np.array([...]), # NaN=warmup, 1=flood
"event_id": 1, # Event identifier
"time": np.array([...]) # Datetime array
}[Warmup Period] → [Flood Period] → [GAP]
marker=NaN marker=1 marker=0
Warmup Period (e.g., 30 days before flood):
- Contains NaN values in observations
- Used to initialize model states
- Length specified by
warmup_lengthin config - Extracted from real data before the flood event
Flood Period (actual event):
- Contains valid observations (marker=1)
- The period of interest for simulation
- Used for model calibration and evaluation
GAP Period (between events):
- Artificial buffer (10 time steps by default)
- Designed for visualization clarity
- NOT used in simulation (marker=0, ignored by model)
- Created with: precipitation=0, ET=0.27, flow=0
When loading multiple events, they are combined as:
Event1: [warmup-NaN][flood-1][GAP-0]
Event2: [warmup-NaN][flood-1][GAP-0]
Event3: [warmup-NaN][flood-1]
Important: The final data structure includes GAP periods, but these are automatically skipped during simulation based on the marker values.
During simulation (unified_simulate.py), the system:
- Reads the
flood_event_markersarray (3rd feature) - Uses
find_flood_event_segments_as_tuples()to identify events - Only processes segments where
marker > 0(i.e., marker=1) - Skips GAP periods (marker=0) completely
# Simulation automatically identifies event segments
flood_event_array = inputs[:, basin_idx, 2] # marker column
event_segments = find_flood_event_segments_as_tuples(
flood_event_array, warmup_length
)
# Each event is simulated independently
for start, end, orig_start, orig_end in event_segments:
event_inputs = inputs[start:end+1, :, :3] # Extract event data
result = model(event_inputs, ...) # Simulate this eventThe GAP period is present in the loaded data but does not participate in simulation:
- ✅ GAP helps separate events visually in plots
- ✅ GAP provides clear boundaries between independent floods
- ❌ GAP data is never fed to the hydrological model
- ❌ GAP does not contribute to loss calculation
This design ensures that:
- Each flood event is simulated independently
- Model states are reset via warmup for each event
- Events don't interfere with each other
data:
dataset: "floodevent"
dataset_name: "my_flood_events"
data_source_path: "D:/flood_data"
is_event_data: true
time_unit: ["1D"]
# Datasource parameters
datasource_kwargs:
warmup_length: 30 # Days before flood for warmup
offset_to_utc: false # Time zone handling
version: null
basin_ids: ["basin_001"]
train_period: ["2000-01-01", "2020-12-31"]
test_period: ["2020-01-01", "2023-12-31"]
variables: ["rain", "ES", "inflow", "flood_event"]
warmup_length: 30
model:
name: "xaj"
params:
source_type: "sources"
source_book: "HF"
kernel_size: 15my_flood_events/
├── attributes/
│ └── attributes.csv
├── timeseries/
│ ├── 1D/
│ │ ├── basin_001.csv # Must include 'flood_event' column
│ │ └── basin_002.csv
│ └── 1D_units_info.json
└── shapes/
└── basins.shp
time,rain,ES,inflow,flood_event
2020-06-01,0.0,3.2,5.1,0
2020-06-02,2.5,3.5,5.8,0
2020-07-01,5.0,4.0,10.2,1 # Flood event starts
2020-07-02,12.5,3.8,25.5,1
2020-07-03,8.0,3.5,30.1,1
2020-07-04,0.0,3.2,20.5,1
2020-07-05,0.0,3.0,12.0,0 # Event endsKey Points:
flood_eventcolumn: 0=no flood, 1=flood period- Continuous time series with all periods marked
- System automatically extracts events with warmup
-
Warmup Length:
- Typical: 30 days for daily data
- Should be long enough to initialize soil moisture states
- Too short: poor initial conditions
- Too long: data availability issues
-
Event Selection:
- Focus on significant floods (peak > threshold)
- Include complete rising and recession limbs
- Ensure warmup period has valid data
-
Data Quality:
- Check for missing data in warmup periods
- Verify flood markers are correctly assigned
- Ensure precipitation and flow are synchronized
-
Marker Assignment:
# Example: Mark floods based on threshold threshold = flow.quantile(0.95) flood_event = (flow > threshold).astype(int)
1. "Warmup period contains all NaN"
- Ensure data exists before each flood event
- Check
warmup_lengthis not too long - Verify CSV has continuous time series
2. "No flood events found"
- Check
flood_eventcolumn exists - Verify flood markers are 1 (not True or other values)
- Ensure train_period covers some flood events
3. "Simulation results are all zeros"
- Check if events are detected: markers should be 1 for flood periods
- Verify warmup_length matches the actual warmup in data
- Ensure model parameters are physically reasonable
For custom event extraction:
from hydroutils import hydro_event
# Extract events from continuous data
events = hydro_event.extract_flood_events(
df=continuous_data,
warmup_length=30,
flood_event_col="flood_event",
time_col="time"
)
# Each event includes warmup automatically
for event in events:
print(f"Event: {event['event_name']}")
print(f" Total length: {len(event['data'])}")
print(f" Warmup markers: {event['data']['flood_event'].isna().sum()}")
print(f" Flood markers: {(event['data']['flood_event']==1).sum()}")| Variable | Description | Unit | Typical Source |
|---|---|---|---|
prcp |
Precipitation | mm/day | Rain gauge, gridded data (CHIRPS, ERA5) |
PET |
Potential Evapotranspiration | mm/day | Penman, Priestley-Taylor, or reanalysis |
streamflow |
Observed streamflow | m³/s | Stream gauge |
area |
Basin area | km² | GIS analysis |
| Variable | Description | Unit | Usage |
|---|---|---|---|
temp |
Temperature | °C | Snow module (if enabled) |
elevation |
Basin elevation | m | PET estimation |
lat, lon |
Coordinates | degrees | Spatial analysis |
-
Time Resolution: Daily (1D) is standard for XAJ model
-
Data Completeness:
- Training period: ≥5 years continuous data
- Warmup period: ≥1 year before training
- Missing data: <5% acceptable, continuous gaps <7 days
-
Physical Consistency:
- Precipitation ≥ 0
- Streamflow ≥ 0
- PET ≥ 0
- Check water balance: P ≈ Q + ET (within 20%)
-
Unit Consistency:
- Ensure all units match
units_info.json - Use consistent time stamps (no daylight saving shifts)
- Ensure all units match
Convert CSV to NetCDF for 10x faster access:
from hydrodatasource.reader.data_source import SelfMadeHydroDataset
dataset = SelfMadeHydroDataset(
data_path="D:/my_basin_data",
time_unit="1D"
)
# Cache all data as NetCDF (one-time operation)
dataset.cache_xrdataset(
gage_id_lst=basin_ids,
t_range=["1990-01-01", "2010-12-31"],
var_lst=["prcp", "PET", "streamflow"]
)
# Now access is much faster
data_xr = dataset.read_ts_xrdataset(
gage_id_lst=["basin_001"],
t_range=["1990-01-01", "2000-12-31"],
var_lst=["prcp", "PET", "streamflow"]
)Support different time scales in one dataset:
timeseries/
├── 1h/ # Hourly data
│ ├── basin_001.csv
│ └── 1h_units_info.json
├── 1D/ # Daily data (most common)
│ ├── basin_001.csv
│ └── 1D_units_info.json
└── 8D/ # 8-day data (e.g., MODIS)
├── basin_001.csv
└── 8D_units_info.json
Specify in config:
dataset = SelfMadeHydroDataset(
data_path="D:/my_basin_data",
time_unit="1h" # or "1D", "8D"
)For large datasets in the cloud:
from hydrodatasource.reader.data_source import SelfMadeHydroDataset
dataset = SelfMadeHydroDataset(
data_path="s3://my-bucket/basin-data",
time_unit="1D",
minio_paras={
"endpoint_url": "http://minio.example.com:9000",
"key_id": "access_key",
"secret_key": "secret_key"
}
)Here's a complete example from raw data to calibration:
import pandas as pd
import json
from hydrodatasource.reader.data_source import SelfMadeHydroDataset
from hydromodel.trainers.unified_calibrate import calibrate
# Step 1: Prepare attributes
attributes = pd.DataFrame({
'basin_id': ['basin_001', 'basin_002'],
'area': [1250.5, 856.3],
'lat': [30.5, 31.2],
'lon': [105.2, 106.1]
})
attributes.to_csv("my_data/attributes/attributes.csv", index=False)
# Step 2: Prepare time series (assume you have daily_data_001.csv)
# Make sure it has columns: time, prcp, PET, streamflow
daily_data = pd.read_csv("daily_data_001.csv")
daily_data.to_csv("my_data/timeseries/1D/basin_001.csv", index=False)
# Step 3: Create units info
units = {
"prcp": "mm/day",
"PET": "mm/day",
"streamflow": "m^3/s"
}
with open("my_data/timeseries/1D_units_info.json", "w") as f:
json.dump(units, f, indent=2)
# Step 4: Verify data loads correctly
dataset = SelfMadeHydroDataset(
data_path="my_data",
time_unit="1D"
)
print(f"Basins: {dataset.read_object_ids()}")
# Step 5: Cache for faster access (optional)
dataset.cache_xrdataset(
gage_id_lst=['basin_001'],
t_range=["1990-01-01", "2010-12-31"],
var_lst=["prcp", "PET", "streamflow"]
)
# Step 6: Run calibration with hydromodel
config = {
"data_cfgs": {
"data_source_type": "selfmadehydrodataset",
"data_source_path": "my_data",
"basin_ids": ["basin_001"],
"train_period": ["1990-01-01", "2000-12-31"],
"test_period": ["2001-01-01", "2010-12-31"],
"warmup_length": 365,
},
"model_cfgs": {
"model_name": "xaj_mz",
},
"training_cfgs": {
"algorithm": "SCE_UA",
"loss_func": "RMSE",
"output_dir": "results",
"experiment_name": "my_basin_001",
"rep": 5000,
"ngs": 50,
},
"evaluation_cfgs": {
"metrics": ["NSE", "KGE", "RMSE"],
},
}
results = calibrate(config)
print("Calibration complete!")1. "Basin ID not found"
- Check
basin_idcolumn inattributes.csvmatches CSV filenames - Basin IDs must be strings (not numbers)
- Filenames:
{basin_id}.csv(e.g.,basin_001.csv)
2. "Time column not found"
- CSV must have
timecolumn (case-sensitive) - Format:
YYYY-MM-DDfor daily,YYYY-MM-DD HH:MMfor hourly
3. "Unit info file not found"
- Create
{time_unit}_units_info.jsonin timeseries folder - Example:
1D_units_info.jsonfor daily data
4. "Variable not found in units info"
- Every variable in CSV must be in
units_info.json - Check spelling matches exactly (case-sensitive)
5. "Data shape mismatch"
- All basins should have same variables
- All basins should cover the requested time range
6. CAMELS data download fails
- Check internet connection
- Check disk space (CAMELS-US needs ~70GB)
- Try manual download from official sources
- Set
download=Falseif data already exists
Before using your custom data:
-
attributes.csvexists withbasin_idandareacolumns - Time series files named
{basin_id}.csv - All CSV files have
timecolumn -
{time_unit}_units_info.jsonexists - All variables in CSV are in
units_info.json - No negative precipitation or streamflow values
- Time series is continuous (no large gaps)
- Data covers warmup + train + test periods
- Units are physically reasonable
Extract basin attributes from shapefile:
import geopandas as gpd
# Read shapefile
basins = gpd.read_file("basins.shp")
# Calculate area (convert to km²)
basins['area'] = basins.geometry.area / 1e6
# Export to CSV
basins[['basin_id', 'area', 'lat', 'lon']].to_csv(
"attributes/attributes.csv",
index=False
)# From Excel
import pandas as pd
df = pd.read_excel("basin_data.xlsx")
df.to_csv("timeseries/1D/basin_001.csv", index=False)
# From NetCDF
import xarray as xr
ds = xr.open_dataset("data.nc")
df = ds.to_dataframe().reset_index()
df.to_csv("timeseries/1D/basin_001.csv", index=False)Choose CAMELS (hydrodataset) if:
- ✅ You need quality-controlled data
- ✅ Working with well-studied basins
- ✅ Want standardized format
- ✅ Need consistent attributes
Choose Custom Data (hydrodatasource) if:
- ✅ Using your own field data
- ✅ Working with ungauged basins
- ✅ Need specific time periods
- ✅ Have proprietary data
- Public Data: Use
hydrodatasetfor CAMELS variants - Custom Data: Use
hydrodatasourcewithselfmadehydrodatasetformat - Data Structure: Follow standard directory layout
- Required Files:
attributes.csv, time series CSVs,units_info.json - Data Quality: Check completeness, consistency, and physical validity
- Performance: Use NetCDF caching for large datasets
- hydrodataset GitHub: https://github.com/OuyangWenyu/hydrodataset
- hydrodataset docs: https://hydrodataset.readthedocs.io/
- hydrodatasource GitHub: https://github.com/OuyangWenyu/hydrodatasource
- hydromodel docs: usage.md, quickstart.md
- CAMELS official sites: