Feature/speed up resample again by FBumann · Pull Request #458 · flixOpt/flixopt

FBumann · 2025-11-04T21:54:55Z

Description

Further speed up resampling and add options to speed up even more by avoiding round trips to and from dataset

Type of Change

Bug fix
New feature
Documentation update
Code refactoring

Related Issues

Closes #(issue number)

Testing

I have tested my changes
Existing tests still pass

Checklist

My code follows the project style
I have updated documentation if needed
I have added tests for new functionality (if applicable)

Summary by CodeRabbit

New Features
- Selection, integer-selection, and resampling can now operate directly on datasets to enable seamless dataset-based chaining.
Refactor
- Resampling performance greatly improved; time-related metadata is now consistently computed, preserved, and updated after transforms.
Tests
- New comprehensive test suite verifies optimized resampling matches naive resampling across diverse dataset shapes and scenarios.
Documentation
- Added usage examples and power-user guidance for dataset-focused workflows.

# Conflicts: # flixopt/flow_system.py

coderabbitai · 2025-11-04T21:55:04Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds dataset-centered selection and resampling helpers to FlowSystem (new _dataset_resample, _dataset_sel, _dataset_isel, and _resample_by_dimension_groups), centralizes time-metadata computation/update, routes sel/isel/resample through dataset helpers, and introduces comprehensive resample-equivalence tests and CHANGELOG notes.

Changes

Cohort / File(s)	Summary
Changelog `CHANGELOG.md`	Documented Unreleased: dataset-based resampling/selection helpers and a performance improvement note for FlowSystem resampling.
FlowSystem dataset helpers & API updates `flixopt/flow_system.py`	Added dataset-focused methods: `_dataset_resample`, `_dataset_sel`, `_dataset_isel`, `_resample_by_dimension_groups`, `_compute_time_metadata`, `_update_time_metadata`; refactored `sel`, `isel`, and `resample` to delegate to dataset helpers, preserve/update time metadata, and use grouped resampling.
Tests `tests/test_resample_equivalence.py`	New test suite validating that grouped/optimized resampling matches naive `Dataset.resample(...).<method>()` across frequencies, methods, mixed dims, NaNs, ordering, empty/single-variable cases, kwargs forwarding, and large datasets.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FlowSystem
    participant DatasetHelpers as Dataset Helpers
    participant GroupResampler as _resample_by_dimension_groups
    participant xr as xr.Dataset

    User->>FlowSystem: call resample/sel/isel(...)
    activate FlowSystem
    FlowSystem->>DatasetHelpers: convert FlowSystem -> xr.Dataset / call helper
    activate DatasetHelpers
    alt resample path
        DatasetHelpers->>GroupResampler: group variables by dimension sets
        activate GroupResampler
        GroupResampler->>xr: stack group -> DataArray
        GroupResampler->>xr: resample per-group (mean/sum/...)
        GroupResampler->>xr: unstack/merge -> xr.Dataset
        GroupResampler-->>DatasetHelpers: grouped resampled xr.Dataset
        deactivate GroupResampler
    else sel/isel path
        DatasetHelpers->>xr: perform sel/isel on xr.Dataset
    end
    DatasetHelpers->>DatasetHelpers: update time metadata (hours_per_timestep, last/prev)
    DatasetHelpers-->>FlowSystem: xr.Dataset
    deactivate DatasetHelpers
    FlowSystem->>FlowSystem: from_dataset(...) -> FlowSystem result
    FlowSystem-->>User: FlowSystem (result)
    deactivate FlowSystem

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Inspect grouping/stacking/unstacking logic in _resample_by_dimension_groups for axis-order, dtype, and coordinate preservation.
Verify _compute_time_metadata and _update_time_metadata for irregular timesteps, single-step, and NaN edge cases.
Confirm sel/isel early-exit behavior and correct round-trip via from_dataset.

Possibly related PRs

Feature/speed up resample #455: Modifies FlowSystem resampling internals and adds/changes _resample_by_dimension_groups; likely overlaps with these resampling optimizations.

Poem

🐰 I hopped through xr data, quick and spry,
I stacked the vars and watched time fly,
I grouped and resampled, kept timestamps neat,
Unstacked the numbers — tidy and sweet,
A carrot for tests that prove it's right! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Feature/speed up resample again' directly corresponds to the main objective of speeding up resampling operations, which is the primary change across all modified files.
Description check	✅ Passed	The description contains a clear summary of changes ('speed up resampling and add options to speed up even more by avoiding round trips'), but several template sections are incomplete: Type of Change has no option selected, Related Issues lacks a specific reference, and Testing checkboxes are not marked completed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4304dca and 937a65e.

📒 Files selected for processing (1)

tests/test_resample_equivalence.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efb8532 and 824fa31.

📒 Files selected for processing (2)

CHANGELOG.md (1 hunks)
flixopt/flow_system.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

flixopt/flow_system.py (1)

flixopt/structure.py (3)

copy (827-839)

to_dataset (668-691)

from_dataset (712-745)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.11)
GitHub Check: test (3.13)
GitHub Check: test (3.12)
GitHub Check: test (3.10)

flixopt/flow_system.py

# Conflicts: # CHANGELOG.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

flixopt/flow_system.py (1)

1203-1205: Critical: Recompute hours_per_timestep after resampling.

After resampling the time dimension, hours_per_timestep in the dataset retains aggregated values (e.g., ~1.0 after method='mean') rather than reflecting the new bin durations. Power users chaining dataset operations will compute incorrect energy totals. The resampled time index must be used to recalculate duration vectors and update both the hours_per_timestep variable and the related attributes.

The fix suggested in the previous review remains applicable:

-        if non_time_var_names:
-            non_time_dataset = dataset[non_time_var_names]
-            result = xr.merge([resampled_time_dataset, non_time_dataset])
-        else:
-            result = resampled_time_dataset
-
-        # Update time-related attributes
-        result.attrs['hours_of_last_timestep'] = hours_of_last_timestep
-        result.attrs['hours_of_previous_timesteps'] = hours_of_previous_timesteps
+        if non_time_var_names:
+            non_time_dataset = dataset[non_time_var_names]
+            result = xr.merge([resampled_time_dataset, non_time_dataset])
+        else:
+            result = resampled_time_dataset
+
+        # Recompute timestep metadata based on the resampled time index
+        new_time_index = result.indexes.get('time')
+        if new_time_index is None or len(new_time_index) < 1:
+            return result
+
+        timesteps_extra = cls._create_timesteps_with_extra(new_time_index, hours_of_last_timestep)
+        hours_per_timestep_recomputed = cls.calculate_hours_per_timestep(timesteps_extra)
+        result['hours_per_timestep'] = hours_per_timestep_recomputed
+
+        if hours_of_last_timestep is None:
+            hours_of_last_timestep = hours_per_timestep_recomputed.isel(time=-1).item()
+        if hours_of_previous_timesteps is None:
+            hours_of_previous_timesteps = hours_per_timestep_recomputed.isel(time=0).item()
+
+        # Update time-related attributes
+        result.attrs['hours_of_last_timestep'] = hours_of_last_timestep
+        result.attrs['hours_of_previous_timesteps'] = hours_of_previous_timesteps

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 824fa31 and f788a14.

📒 Files selected for processing (2)

CHANGELOG.md (1 hunks)
flixopt/flow_system.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

flixopt/flow_system.py (1)

flixopt/structure.py (6)

copy (827-839)

to_dataset (668-691)

from_dataset (712-745)

values (1199-1204)

values (1481-1482)

hours_of_previous_timesteps (189-190)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.11)
GitHub Check: test (3.12)
GitHub Check: test (3.13)
GitHub Check: test (3.10)

🔇 Additional comments (8)

CHANGELOG.md (1)

65-65: LGTM!

The performance improvement note appropriately documents the resampling optimization.

flixopt/flow_system.py (7)

124-139: LGTM!

The power user documentation clearly demonstrates the performance benefits of dataset-based chaining.

930-974: LGTM!

The implementation correctly handles label-based selection with appropriate early returns.

976-1004: LGTM!

The delegation to _dataset_sel is clean, and the early return for no-op selections is a good optimization.

1006-1039: LGTM!

Consistent implementation pattern with _dataset_sel for integer-based indexing.

1041-1069: LGTM!

Proper delegation to _dataset_isel with consistent handling.

1071-1153: Excellent performance optimization!

The dimension-based grouping strategy elegantly prevents broadcasting issues while achieving significant speedups. The DataArray concatenation approach and proper handling of edge cases demonstrate careful implementation.

1209-1246: Delegation structure is sound.

The public resample method properly delegates to _dataset_resample. However, this inherits the critical hours_per_timestep recomputation issue flagged in _dataset_resample (lines 1203-1205).

CHANGELOG.md

…directly

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f788a14 and 34509f4.

📒 Files selected for processing (1)

flixopt/flow_system.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

flixopt/flow_system.py (1)

flixopt/structure.py (6)

hours_of_previous_timesteps (189-190)

get (1328-1333)

get (1488-1490)

copy (827-839)

to_dataset (668-691)

from_dataset (712-745)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.13)
GitHub Check: test (3.11)
GitHub Check: test (3.10)
GitHub Check: test (3.12)

flixopt/flow_system.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/test_resample_equivalence.py (2)
26-82: Consider setting a random seed for reproducibility.

The function uses np.random.randn() without setting a seed, which makes tests non-deterministic. While xr.testing.assert_allclose should handle numerical precision differences, setting a seed (e.g., np.random.seed(42) at the start) would ensure fully reproducible test results and make debugging easier if failures occur.

Apply this diff to add a seed parameter with a default value:
-def create_dataset_with_mixed_dimensions(n_timesteps=48):
+def create_dataset_with_mixed_dimensions(n_timesteps=48, seed=42):
     """
     Create a dataset with variables having different dimension structures.
 
     This mimics realistic data with:
     - Variables with only time dimension
     - Variables with time + one other dimension
     - Variables with time + multiple dimensions
     """
+    np.random.seed(seed)
     timesteps = pd.date_range('2020-01-01', periods=n_timesteps, freq='h')
85-272: Consider testing the **kwargs parameter path.

The _resample_by_dimension_groups method accepts **kwargs that are forwarded to xarray.resample(), but none of the tests exercise this parameter. Consider adding a test that passes additional arguments (e.g., label='right', closed='right') to verify they're correctly forwarded through the optimization path.

Example test to add:
def test_resample_equivalence_with_kwargs():
    """Test that kwargs are properly forwarded to resample()."""
    timesteps = pd.date_range('2020-01-01', periods=48, freq='h')
    ds = xr.Dataset(coords={'time': timesteps})
    ds['var'] = xr.DataArray(np.random.randn(48), dims=['time'])
    
    kwargs = {'label': 'right', 'closed': 'right'}
    result_optimized = fx.FlowSystem._resample_by_dimension_groups(
        ds, '2h', 'mean', **kwargs
    )
    result_naive = getattr(ds.resample(time='2h', **kwargs), 'mean')()
    
    xr.testing.assert_allclose(result_optimized, result_naive)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0e1f16 and 9fb3718.

📒 Files selected for processing (1)

tests/test_resample_equivalence.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_resample_equivalence.py (1)

flixopt/flow_system.py (2)

resample (1309-1347)

_resample_by_dimension_groups (1174-1254)

🪛 GitHub Actions: Python Package CI/CD

tests/test_resample_equivalence.py

[warning] 93-93: Ruff formatting: 1 file would be reformatted; 55 files already formatted. The change shown in the diff suggests a multi-line function call was reformatted to a single line.

[error] 1-1: Ruff formatting failed (exit code 1). Run 'ruff format' to fix code style issues.

🔇 Additional comments (5)

tests/test_resample_equivalence.py (5)

129-141: Good edge case coverage!

Testing the empty dataset case ensures the optimization handles this gracefully. The implementation should return an empty resampled dataset without errors.

159-180: Excellent NaN handling coverage!

This test ensures both the optimized and naive approaches handle NaN values consistently across aggregation methods. This is critical for correctness since NaN handling can differ subtly between xarray operations.

182-219: Dimension order independence is well-tested.

This test verifies that the grouping logic correctly handles variables regardless of where the time dimension appears. This is important since the _resample_by_dimension_groups implementation groups by non-time dimensions.

221-244: Key optimization case is properly validated.

This test directly exercises the core benefit of _resample_by_dimension_groups: grouping multiple variables with identical dimensions for efficient batch resampling. The comment correctly identifies this as the key optimization scenario.

246-272: Realistic scale testing with pragmatic method selection.

Testing with a larger dataset (168 timesteps, multiple dimensions) ensures the optimization works at realistic scales. The selective method testing is a sensible tradeoff between coverage and test performance.

tests/test_resample_equivalence.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/test_resample_equivalence.py (1)
85-86: Consider adding 'count' method to test coverage.

The _dataset_resample method in flow_system.py lists 'count' as an available method (line 1256), but it's not included in any of the parametrized test methods here. While the current coverage of 9 other methods is thorough, adding 'count' would ensure complete method coverage.

Apply this diff to include 'count' in the first parametrized test:
-@pytest.mark.parametrize('method', ['mean', 'sum', 'max', 'min', 'first', 'last'])
+@pytest.mark.parametrize('method', ['mean', 'sum', 'max', 'min', 'first', 'last', 'count'])
 @pytest.mark.parametrize('freq', ['2h', '4h', '1D'])
 def test_resample_equivalence_mixed_dimensions(method, freq):

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fb3718 and fe33c70.

📒 Files selected for processing (1)

tests/test_resample_equivalence.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_resample_equivalence.py (1)

flixopt/flow_system.py (3)

resample (1309-1347)

FlowSystem (49-1351)

_resample_by_dimension_groups (1174-1254)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.12)
GitHub Check: test (3.10)
GitHub Check: test (3.11)
GitHub Check: test (3.13)

🔇 Additional comments (3)

tests/test_resample_equivalence.py (3)

17-82: LGTM! Well-structured helper functions.

The naive_dataset_resample function provides a clear baseline for comparison, and create_dataset_with_mixed_dimensions creates comprehensive test data covering various dimension combinations that would appear in real-world energy system modeling.

85-269: Excellent test coverage for resampling equivalence.

The test suite comprehensively validates the dimension-grouping optimization against the naive approach across multiple scenarios:

Mixed dimension structures (the critical optimization case)

Edge cases (empty datasets, single variables)

Data quality issues (NaNs)

Dimension ordering variations

Realistic scales

The systematic use of xr.testing.assert_allclose ensures proper floating-point comparison across all scenarios.

272-273: LGTM!

Standard pytest main block for direct test execution.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe33c70 and 4304dca.

📒 Files selected for processing (1)

tests/test_resample_equivalence.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_resample_equivalence.py (1)

flixopt/flow_system.py (3)

resample (1309-1347)

FlowSystem (49-1351)

_resample_by_dimension_groups (1174-1254)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test (3.12)
GitHub Check: test (3.13)
GitHub Check: test (3.10)
GitHub Check: test (3.11)

tests/test_resample_equivalence.py

FBumann added 29 commits November 4, 2025 08:15

Resample a single concatenated dataarray instead of a Dataset

a504914

Performance improvements

92633a5

Use helper method for resampling speed up resampling

aa04d29

Improve docstring

dbd1405

Improve docstring

dfa6ef9

avoiding attribute conflicts and empty merge errors

75cbb64

moving method validation earlier for fail-fast behavior

b50d538

Update CHANGELOG.md

7bf1925

Add new combined select and resample method

8eacc67

Remove code duplication

05ff393

Merge branch 'main' into feature/speed-up-resample-again

7f30d15

# Conflicts: # flixopt/flow_system.py

Add benchmark

fb6e426

Improve becnhmark

c770102

Improve becnhmark

ec4bff4

Add power user chaining options

382f401

Remove becnhmark

f5d8625

Use dask chunking in resample

2550853

Make the new methods class methods

927f275

Update benchmark_bottleneck.py

1b98d2f

Use dataframe based approach

a762fb0

registry pattern

5e2f1b3

registry pattern

e616768

Improve benchmark

5739661

Benchmark datarray version

995b6ac

Use dataarray conversion before resampling

cf609ac

Benchmark dask speedup

96f2ea9

Add dask chunking

e4ad7c0

Remove dask chunking due to negligible improvements

af4c66a

Remove benchmark_bottleneck.py

824fa31

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

flixopt/flow_system.py Outdated Show resolved Hide resolved

FBumann added 2 commits November 5, 2025 08:41

Merge branch 'main' into feature/speed-up-resample-again

622db1b

# Conflicts: # CHANGELOG.md

Update CHANGELOG.md

f788a14

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

FBumann added 5 commits November 5, 2025 09:02

Make ._dataset_... methods self contained, handling time index stuff …

e79af68

…directly

Use helper method

89507bb

further deduplication and consistency improvements

5c6c38b

improve docstrings

f047b7a

ruff format

34509f4

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

flixopt/flow_system.py Show resolved Hide resolved

FBumann added 4 commits November 5, 2025 09:57

fixed the metadata preservation issue in flow_system.py:352-369

67ea6ec

Typo

f0e1f16

Add test

a3059be

Speed up tests

9fb3718

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

tests/test_resample_equivalence.py Outdated Show resolved Hide resolved

ruff format

fe33c70

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

FBumann added 2 commits November 6, 2025 16:25

Improve tests

60f6d43

Linting fixes

4304dca

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

tests/test_resample_equivalence.py Show resolved Hide resolved

Fix tests

937a65e

FBumann merged commit 275cd6a into main Nov 6, 2025
7 of 8 checks passed

FBumann deleted the feature/speed-up-resample-again branch November 19, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/speed up resample again#458

Feature/speed up resample again#458
FBumann merged 44 commits intomainfrom
feature/speed-up-resample-again

FBumann commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 4, 2025 •

edited

Loading

Review failed

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FBumann commented Nov 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FBumann commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading