Skip to content

Feature/speed up resample again#458

Merged
FBumann merged 44 commits intomainfrom
feature/speed-up-resample-again
Nov 6, 2025
Merged

Feature/speed up resample again#458
FBumann merged 44 commits intomainfrom
feature/speed-up-resample-again

Conversation

@FBumann
Copy link
Member

@FBumann FBumann commented Nov 4, 2025

Description

Further speed up resampling and add options to speed up even more by avoiding round trips to and from dataset

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Code refactoring

Related Issues

Closes #(issue number)

Testing

  • I have tested my changes
  • Existing tests still pass

Checklist

  • My code follows the project style
  • I have updated documentation if needed
  • I have added tests for new functionality (if applicable)

Summary by CodeRabbit

  • New Features

    • Selection, integer-selection, and resampling can now operate directly on datasets to enable seamless dataset-based chaining.
  • Refactor

    • Resampling performance greatly improved; time-related metadata is now consistently computed, preserved, and updated after transforms.
  • Tests

    • New comprehensive test suite verifies optimized resampling matches naive resampling across diverse dataset shapes and scenarios.
  • Documentation

    • Added usage examples and power-user guidance for dataset-focused workflows.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds dataset-centered selection and resampling helpers to FlowSystem (new _dataset_resample, _dataset_sel, _dataset_isel, and _resample_by_dimension_groups), centralizes time-metadata computation/update, routes sel/isel/resample through dataset helpers, and introduces comprehensive resample-equivalence tests and CHANGELOG notes.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Documented Unreleased: dataset-based resampling/selection helpers and a performance improvement note for FlowSystem resampling.
FlowSystem dataset helpers & API updates
flixopt/flow_system.py
Added dataset-focused methods: _dataset_resample, _dataset_sel, _dataset_isel, _resample_by_dimension_groups, _compute_time_metadata, _update_time_metadata; refactored sel, isel, and resample to delegate to dataset helpers, preserve/update time metadata, and use grouped resampling.
Tests
tests/test_resample_equivalence.py
New test suite validating that grouped/optimized resampling matches naive Dataset.resample(...).<method>() across frequencies, methods, mixed dims, NaNs, ordering, empty/single-variable cases, kwargs forwarding, and large datasets.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FlowSystem
    participant DatasetHelpers as Dataset Helpers
    participant GroupResampler as _resample_by_dimension_groups
    participant xr as xr.Dataset

    User->>FlowSystem: call resample/sel/isel(...)
    activate FlowSystem
    FlowSystem->>DatasetHelpers: convert FlowSystem -> xr.Dataset / call helper
    activate DatasetHelpers
    alt resample path
        DatasetHelpers->>GroupResampler: group variables by dimension sets
        activate GroupResampler
        GroupResampler->>xr: stack group -> DataArray
        GroupResampler->>xr: resample per-group (mean/sum/...)
        GroupResampler->>xr: unstack/merge -> xr.Dataset
        GroupResampler-->>DatasetHelpers: grouped resampled xr.Dataset
        deactivate GroupResampler
    else sel/isel path
        DatasetHelpers->>xr: perform sel/isel on xr.Dataset
    end
    DatasetHelpers->>DatasetHelpers: update time metadata (hours_per_timestep, last/prev)
    DatasetHelpers-->>FlowSystem: xr.Dataset
    deactivate DatasetHelpers
    FlowSystem->>FlowSystem: from_dataset(...) -> FlowSystem result
    FlowSystem-->>User: FlowSystem (result)
    deactivate FlowSystem
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect grouping/stacking/unstacking logic in _resample_by_dimension_groups for axis-order, dtype, and coordinate preservation.
  • Verify _compute_time_metadata and _update_time_metadata for irregular timesteps, single-step, and NaN edge cases.
  • Confirm sel/isel early-exit behavior and correct round-trip via from_dataset.

Possibly related PRs

  • Feature/speed up resample #455: Modifies FlowSystem resampling internals and adds/changes _resample_by_dimension_groups; likely overlaps with these resampling optimizations.

Poem

🐰 I hopped through xr data, quick and spry,
I stacked the vars and watched time fly,
I grouped and resampled, kept timestamps neat,
Unstacked the numbers — tidy and sweet,
A carrot for tests that prove it's right! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Feature/speed up resample again' directly corresponds to the main objective of speeding up resampling operations, which is the primary change across all modified files.
Description check ✅ Passed The description contains a clear summary of changes ('speed up resampling and add options to speed up even more by avoiding round trips'), but several template sections are incomplete: Type of Change has no option selected, Related Issues lacks a specific reference, and Testing checkboxes are not marked completed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4304dca and 937a65e.

📒 Files selected for processing (1)
  • tests/test_resample_equivalence.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efb8532 and 824fa31.

📒 Files selected for processing (2)
  • CHANGELOG.md (1 hunks)
  • flixopt/flow_system.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flixopt/flow_system.py (1)
flixopt/structure.py (3)
  • copy (827-839)
  • to_dataset (668-691)
  • from_dataset (712-745)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.10)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
flixopt/flow_system.py (1)

1203-1205: Critical: Recompute hours_per_timestep after resampling.

After resampling the time dimension, hours_per_timestep in the dataset retains aggregated values (e.g., ~1.0 after method='mean') rather than reflecting the new bin durations. Power users chaining dataset operations will compute incorrect energy totals. The resampled time index must be used to recalculate duration vectors and update both the hours_per_timestep variable and the related attributes.

The fix suggested in the previous review remains applicable:

-        if non_time_var_names:
-            non_time_dataset = dataset[non_time_var_names]
-            result = xr.merge([resampled_time_dataset, non_time_dataset])
-        else:
-            result = resampled_time_dataset
-
-        # Update time-related attributes
-        result.attrs['hours_of_last_timestep'] = hours_of_last_timestep
-        result.attrs['hours_of_previous_timesteps'] = hours_of_previous_timesteps
+        if non_time_var_names:
+            non_time_dataset = dataset[non_time_var_names]
+            result = xr.merge([resampled_time_dataset, non_time_dataset])
+        else:
+            result = resampled_time_dataset
+
+        # Recompute timestep metadata based on the resampled time index
+        new_time_index = result.indexes.get('time')
+        if new_time_index is None or len(new_time_index) < 1:
+            return result
+
+        timesteps_extra = cls._create_timesteps_with_extra(new_time_index, hours_of_last_timestep)
+        hours_per_timestep_recomputed = cls.calculate_hours_per_timestep(timesteps_extra)
+        result['hours_per_timestep'] = hours_per_timestep_recomputed
+
+        if hours_of_last_timestep is None:
+            hours_of_last_timestep = hours_per_timestep_recomputed.isel(time=-1).item()
+        if hours_of_previous_timesteps is None:
+            hours_of_previous_timesteps = hours_per_timestep_recomputed.isel(time=0).item()
+
+        # Update time-related attributes
+        result.attrs['hours_of_last_timestep'] = hours_of_last_timestep
+        result.attrs['hours_of_previous_timesteps'] = hours_of_previous_timesteps
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 824fa31 and f788a14.

📒 Files selected for processing (2)
  • CHANGELOG.md (1 hunks)
  • flixopt/flow_system.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flixopt/flow_system.py (1)
flixopt/structure.py (6)
  • copy (827-839)
  • to_dataset (668-691)
  • from_dataset (712-745)
  • values (1199-1204)
  • values (1481-1482)
  • hours_of_previous_timesteps (189-190)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.10)
🔇 Additional comments (8)
CHANGELOG.md (1)

65-65: LGTM!

The performance improvement note appropriately documents the resampling optimization.

flixopt/flow_system.py (7)

124-139: LGTM!

The power user documentation clearly demonstrates the performance benefits of dataset-based chaining.


930-974: LGTM!

The implementation correctly handles label-based selection with appropriate early returns.


976-1004: LGTM!

The delegation to _dataset_sel is clean, and the early return for no-op selections is a good optimization.


1006-1039: LGTM!

Consistent implementation pattern with _dataset_sel for integer-based indexing.


1041-1069: LGTM!

Proper delegation to _dataset_isel with consistent handling.


1071-1153: Excellent performance optimization!

The dimension-based grouping strategy elegantly prevents broadcasting issues while achieving significant speedups. The DataArray concatenation approach and proper handling of edge cases demonstrate careful implementation.


1209-1246: Delegation structure is sound.

The public resample method properly delegates to _dataset_resample. However, this inherits the critical hours_per_timestep recomputation issue flagged in _dataset_resample (lines 1203-1205).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f788a14 and 34509f4.

📒 Files selected for processing (1)
  • flixopt/flow_system.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flixopt/flow_system.py (1)
flixopt/structure.py (6)
  • hours_of_previous_timesteps (189-190)
  • get (1328-1333)
  • get (1488-1490)
  • copy (827-839)
  • to_dataset (668-691)
  • from_dataset (712-745)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.10)
  • GitHub Check: test (3.12)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/test_resample_equivalence.py (2)

26-82: Consider setting a random seed for reproducibility.

The function uses np.random.randn() without setting a seed, which makes tests non-deterministic. While xr.testing.assert_allclose should handle numerical precision differences, setting a seed (e.g., np.random.seed(42) at the start) would ensure fully reproducible test results and make debugging easier if failures occur.

Apply this diff to add a seed parameter with a default value:

-def create_dataset_with_mixed_dimensions(n_timesteps=48):
+def create_dataset_with_mixed_dimensions(n_timesteps=48, seed=42):
     """
     Create a dataset with variables having different dimension structures.
 
     This mimics realistic data with:
     - Variables with only time dimension
     - Variables with time + one other dimension
     - Variables with time + multiple dimensions
     """
+    np.random.seed(seed)
     timesteps = pd.date_range('2020-01-01', periods=n_timesteps, freq='h')

85-272: Consider testing the **kwargs parameter path.

The _resample_by_dimension_groups method accepts **kwargs that are forwarded to xarray.resample(), but none of the tests exercise this parameter. Consider adding a test that passes additional arguments (e.g., label='right', closed='right') to verify they're correctly forwarded through the optimization path.

Example test to add:

def test_resample_equivalence_with_kwargs():
    """Test that kwargs are properly forwarded to resample()."""
    timesteps = pd.date_range('2020-01-01', periods=48, freq='h')
    ds = xr.Dataset(coords={'time': timesteps})
    ds['var'] = xr.DataArray(np.random.randn(48), dims=['time'])
    
    kwargs = {'label': 'right', 'closed': 'right'}
    result_optimized = fx.FlowSystem._resample_by_dimension_groups(
        ds, '2h', 'mean', **kwargs
    )
    result_naive = getattr(ds.resample(time='2h', **kwargs), 'mean')()
    
    xr.testing.assert_allclose(result_optimized, result_naive)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0e1f16 and 9fb3718.

📒 Files selected for processing (1)
  • tests/test_resample_equivalence.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_resample_equivalence.py (1)
flixopt/flow_system.py (2)
  • resample (1309-1347)
  • _resample_by_dimension_groups (1174-1254)
🪛 GitHub Actions: Python Package CI/CD
tests/test_resample_equivalence.py

[warning] 93-93: Ruff formatting: 1 file would be reformatted; 55 files already formatted. The change shown in the diff suggests a multi-line function call was reformatted to a single line.


[error] 1-1: Ruff formatting failed (exit code 1). Run 'ruff format' to fix code style issues.

🔇 Additional comments (5)
tests/test_resample_equivalence.py (5)

129-141: Good edge case coverage!

Testing the empty dataset case ensures the optimization handles this gracefully. The implementation should return an empty resampled dataset without errors.


159-180: Excellent NaN handling coverage!

This test ensures both the optimized and naive approaches handle NaN values consistently across aggregation methods. This is critical for correctness since NaN handling can differ subtly between xarray operations.


182-219: Dimension order independence is well-tested.

This test verifies that the grouping logic correctly handles variables regardless of where the time dimension appears. This is important since the _resample_by_dimension_groups implementation groups by non-time dimensions.


221-244: Key optimization case is properly validated.

This test directly exercises the core benefit of _resample_by_dimension_groups: grouping multiple variables with identical dimensions for efficient batch resampling. The comment correctly identifies this as the key optimization scenario.


246-272: Realistic scale testing with pragmatic method selection.

Testing with a larger dataset (168 timesteps, multiple dimensions) ensures the optimization works at realistic scales. The selective method testing is a sensible tradeoff between coverage and test performance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_resample_equivalence.py (1)

85-86: Consider adding 'count' method to test coverage.

The _dataset_resample method in flow_system.py lists 'count' as an available method (line 1256), but it's not included in any of the parametrized test methods here. While the current coverage of 9 other methods is thorough, adding 'count' would ensure complete method coverage.

Apply this diff to include 'count' in the first parametrized test:

-@pytest.mark.parametrize('method', ['mean', 'sum', 'max', 'min', 'first', 'last'])
+@pytest.mark.parametrize('method', ['mean', 'sum', 'max', 'min', 'first', 'last', 'count'])
 @pytest.mark.parametrize('freq', ['2h', '4h', '1D'])
 def test_resample_equivalence_mixed_dimensions(method, freq):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fb3718 and fe33c70.

📒 Files selected for processing (1)
  • tests/test_resample_equivalence.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_resample_equivalence.py (1)
flixopt/flow_system.py (3)
  • resample (1309-1347)
  • FlowSystem (49-1351)
  • _resample_by_dimension_groups (1174-1254)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.10)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🔇 Additional comments (3)
tests/test_resample_equivalence.py (3)

17-82: LGTM! Well-structured helper functions.

The naive_dataset_resample function provides a clear baseline for comparison, and create_dataset_with_mixed_dimensions creates comprehensive test data covering various dimension combinations that would appear in real-world energy system modeling.


85-269: Excellent test coverage for resampling equivalence.

The test suite comprehensively validates the dimension-grouping optimization against the naive approach across multiple scenarios:

  • Mixed dimension structures (the critical optimization case)
  • Edge cases (empty datasets, single variables)
  • Data quality issues (NaNs)
  • Dimension ordering variations
  • Realistic scales

The systematic use of xr.testing.assert_allclose ensures proper floating-point comparison across all scenarios.


272-273: LGTM!

Standard pytest main block for direct test execution.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe33c70 and 4304dca.

📒 Files selected for processing (1)
  • tests/test_resample_equivalence.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_resample_equivalence.py (1)
flixopt/flow_system.py (3)
  • resample (1309-1347)
  • FlowSystem (49-1351)
  • _resample_by_dimension_groups (1174-1254)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.10)
  • GitHub Check: test (3.11)

@FBumann FBumann merged commit 275cd6a into main Nov 6, 2025
7 of 8 checks passed
@FBumann FBumann deleted the feature/speed-up-resample-again branch November 19, 2025 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant