-
Notifications
You must be signed in to change notification settings - Fork 839
improvement: enhance lazy df utilities and ensure dataframe type preservation in transformations #7401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ervation in transformations Fixes #7399 When using `mo.ui.dataframe` with lazy dataframes (Ibis tables, Polars LazyFrames), the lazy behavior was being lost. The dataframe would be inadvertently collected/materialized, converting lazy Ibis tables to PyArrow tables and Polars LazyFrames to eager DataFrames. This PR refactors the lazy/eager tracking logic into a reusable utility function that properly preserves the original dataframe type. This PR adds two utilities `make_lazy` and `collect_and_preserve_type` to help with this nuanced logic and then updates usage of this throughout (and some tests).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances lazy dataframe handling in marimo by introducing two new utility functions (make_lazy and collect_and_preserve_type) that properly preserve the original dataframe type (eager vs. lazy) throughout transformations. Previously, lazy dataframes like Ibis tables and Polars LazyFrames were inadvertently being materialized when used with mo.ui.dataframe, converting them to eager types. The new utilities track the original type and provide callback functions to restore it after operations.
Key Changes
- Added
make_lazy()utility that converts any dataframe to lazy and returns an undo callback - Added
collect_and_preserve_type()utility that collects a lazy dataframe while preserving backend information - Refactored dataframe transform handlers to use these utilities instead of manual collect/lazy calls
- Added comprehensive test coverage for type preservation across all dataframe backends
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
marimo/_utils/narwhals_utils.py |
Adds the new make_lazy and collect_and_preserve_type utility functions with type tracking and undo callbacks |
marimo/_plugins/ui/_impl/dataframes/dataframe.py |
Refactors to use make_lazy with undo callback instead of manual lazy/eager tracking |
marimo/_plugins/ui/_impl/dataframes/transforms/apply.py |
Simplifies transform application by using make_lazy and its undo callback |
marimo/_plugins/ui/_impl/dataframes/transforms/handlers.py |
Updates shuffle and sample handlers to use collect_and_preserve_type |
marimo/_plugins/ui/_impl/altair_chart.py |
Updates _filter_dataframe to use make_lazy for type preservation |
marimo/_plugins/ui/_impl/charts/altair_transformer.py |
Refactors sanitize_nan_infs to use make_lazy |
tests/_utils/test_narwhals_utils.py |
Adds comprehensive tests for the new utility functions |
tests/_plugins/ui/_impl/dataframes/test_dataframe.py |
Adds type preservation assertions to existing tests |
tests/_plugins/ui/_impl/dataframes/test_handlers.py |
Updates tests to use make_lazy and verify type preservation |
tests/_plugins/ui/_impl/test_table.py |
Adds type preservation assertions to table tests |
tests/_plugins/ui/_impl/test_altair_chart.py |
Adds test for dataframe type preservation in chart filtering |
tests/_plugins/ui/_impl/charts/test_altair_transformers.py |
Adds test for type preservation in NaN/Inf sanitization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
marimo/_utils/narwhals_utils.py
Outdated
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| UndoCallback = Callable[[nw.LazyFrame[Any]], IntoFrame] |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UndoCallback type hint specifies that it takes a nw.LazyFrame[Any], but the actual undo function accepts Union[nw.LazyFrame[Any], nw.DataFrame[Any]]. The type hint should be updated to match the implementation: Callable[[Union[nw.LazyFrame[Any], nw.DataFrame[Any]]], IntoFrame]
| UndoCallback = Callable[[nw.LazyFrame[Any]], IntoFrame] | |
| UndoCallback = Callable[[Union[nw.LazyFrame[Any], nw.DataFrame[Any]]], IntoFrame] |
| "df", | ||
| create_dataframes( | ||
| {"A": [1, 2, 3], "B": ["a", "b", "c"]}, | ||
| # exclude=["duckdb", "ibis", "lazy-polars"], |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] There's a commented-out exclude parameter that seems to have been used during development. If this exclusion is no longer needed, the comment should be removed. If it's needed for specific cases, it should be uncommented and documented why certain backends are excluded.
| # exclude=["duckdb", "ibis", "lazy-polars"], |
tests/_utils/test_narwhals_utils.py
Outdated
| def test_passing_as_different_df_to_undo(): | ||
| """Test that passing a different dataframe type to undo raises an error.""" |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The test name test_passing_as_different_df_to_undo suggests it should test error handling when passing a different dataframe type, but the test doesn't actually verify any error behavior. The test just verifies that the undo callback returns the correct type. Consider renaming the test to better reflect what it actually tests, e.g., test_undo_preserves_lazy_or_eager_type.
| def test_passing_as_different_df_to_undo(): | |
| """Test that passing a different dataframe type to undo raises an error.""" | |
| def test_undo_preserves_lazy_or_eager_type(): | |
| """Test that the undo callback returns the correct type (eager or lazy) matching the original input.""" |
| def test_ibis_dataframe_types_is_preserved(df: IntoDataFrame): | ||
| """Test that Ibis tables remain lazy after using mo.ui.dataframe.""" |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The test name test_ibis_dataframe_types_is_preserved is misleading because it tests all dataframe types (via create_dataframes), not just Ibis tables. Consider renaming to test_dataframe_types_are_preserved or test_ui_dataframe_preserves_type to better reflect what the test actually does.
| def test_ibis_dataframe_types_is_preserved(df: IntoDataFrame): | |
| """Test that Ibis tables remain lazy after using mo.ui.dataframe.""" | |
| def test_dataframe_types_are_preserved(df: IntoDataFrame): | |
| """Test that dataframe types are preserved after using mo.ui.dataframe.""" |
| ), | ||
| ) | ||
| def test_ibis_dataframe_types_is_preserved(df: IntoDataFrame): | ||
| """Test that Ibis tables remain lazy after using mo.ui.dataframe.""" |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring "Test that Ibis tables remain lazy after using mo.ui.dataframe" doesn't match the test implementation, which tests all dataframe types, not just Ibis. Update the docstring to something like "Test that dataframe types are preserved after using mo.ui.dataframe."
| """Test that Ibis tables remain lazy after using mo.ui.dataframe.""" | |
| """Test that dataframe types are preserved after using mo.ui.dataframe.""" |
| # Return the original data using the undo callback | ||
| lazy_data, _ = make_lazy(self._data) | ||
| return self._undo(lazy_data) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new lazy dataframe and discarding its undo callback is inefficient. Instead, use self._transform_container._original_df which already contains the lazy version of the original dataframe. This avoids redundant conversion and ensures consistency.
| lazy_data, _ = make_lazy(self._data) | ||
| return self._undo(lazy_data) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new lazy dataframe and discarding its undo callback is inefficient. Instead, use self._transform_container._original_df which already contains the lazy version of the original dataframe. This avoids redundant conversion and ensures consistency.
marimo/_utils/narwhals_utils.py
Outdated
| nw_df = nw.from_native(df, pass_through=False) | ||
| original_backend = nw_df.implementation |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter df is already typed as nw.LazyFrame[Any], so calling nw.from_native(df, pass_through=False) is redundant. The function should directly use df.implementation and df.collect() instead. This extra conversion is unnecessary and could potentially cause issues if the LazyFrame is already a narwhals object.
| nw_df = nw.from_native(df, pass_through=False) | |
| original_backend = nw_df.implementation | |
| original_backend = df.implementation |
| LOGGER.warning( | ||
| "Expected a narwhals DataFrame, got %s", type(result) | ||
| ) | ||
| return result.lazy() |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input is not a narwhals DataFrame, the function logs a warning but then calls result.lazy(), which will likely fail if result is not a narwhals object. The function should either return early after the warning or handle the non-DataFrame case more gracefully by raising an exception or attempting to convert it.
| return result.lazy() | |
| return result |
|
@mscolnick @Light2Dark |
glad to hear it! Myles is a beast :) |
Fixes #7399
When using
mo.ui.dataframewith lazy dataframes (Ibis tables, Polars LazyFrames), the lazy behavior was being lost. The dataframe would be inadvertently collected/materialized, converting lazy Ibis tables to PyArrow tables and Polars LazyFrames to eager DataFrames. This PR refactors the lazy/eager tracking logic into a reusable utility function that properly preserves the original dataframe type.This PR adds two utilities
make_lazyandcollect_and_preserve_typeto help with this nuanced logic and then updates usage of this throughout (and some tests).