Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901
Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901aman-coder03 wants to merge 12 commits into
Conversation
|
Thanks for looking into this, @aman-coder03. As I commented on #855, converting the dataframe into numpy is not ideal for Polars, as it doesn't benefit from its performant features. Instead, adding support for native Pandas and Polars DataFrames is recommended. It will require careful inspection of the current indexing to make it comparable to the indexing on DataFrames. |
|
thanks for the feedback @jeongyoonlee i understand the concern....converting to NumPy at the boundary means Polars users lose all the performance benefits(lazy evaluation, zero-copy operations, columnar efficiency) that they came for in the first place for the native approach, my plan would be...
Before I go ahead and rewrite, a couple of questions to make sure i am heading in the right direction...
Happy to update the PR once we are aligned |
|
Thanks for the detailed plan, @aman-coder03. One important correction before you start: converting to NumPy at the sklearn boundary isn't necessary.
So the rewrite scope collapses to causalml's own indexing/concat/mutation — not the model calls. Answers to your questions:
Suggested phasing:
A few practical notes for the audit:
Happy to review Phase 1 once it's ready. |
|
thanks for the detailed feedback @jeongyoonlee i'm happy to reorganize if you'd prefer the phased approach...I can split this into separate PRs for Phase 1 (pandas cleanup) and Phase 2/3 (Polars per learner). But if the current implementation looks sound to you, I'd love to land it as is to avoid rebasing overhead |
jeongyoonlee
left a comment
There was a problem hiding this comment.
Thanks for pushing this through to a full implementation, @aman-coder03 — the helper-based
dispatch (filter_mask / filter_index / prepend_column / concat_treatment_col) is the
right abstraction, and the T-learner is a good reference (docstrings kept + updated, explicit
LazyFrame.collect() guard). The direction is solid.
One process note before the specifics: this landed all of Phase 1–3 plus a large cosmetic
refactor in a single ~1k-line PR, which makes it hard to review and bisect. I'm not going to
ask you to re-split it now, but the four items below are blocking regardless of how it's
packaged.
Blocking
1. Circular import — import causalml.propensity crashes on a cold import
propensity.py adds a top-level from causalml.inference.meta.utils import convert_pd_to_np.
Importing that submodule runs meta/__init__.py → slearner → base.py (from causalml.propensity import compute_propensity_score) back into the half-initialized
propensity module. Reproduced by applying just that one line:
File ".../causalml/propensity.py", line 11, in <module>
from causalml.inference.meta.utils import convert_pd_to_np
File ".../causalml/inference/meta/__init__.py", line 1, in <module>
from .slearner import LRSRegressor, BaseSLearner, BaseSRegressor, BaseSClassifier
File ".../causalml/inference/meta/slearner.py", line 9, in <module>
from causalml.inference.meta.base import BaseLearner
File ".../causalml/inference/meta/base.py", line 8, in <module>
from causalml.propensity import compute_propensity_score
ImportError: cannot import name 'compute_propensity_score' from partially initialized
module 'causalml.propensity' (most likely due to a circular import)
Both import causalml.propensity and from causalml.propensity import ElasticNetPropensityModel fail cold. It passes locally only because the test process imports
causalml.inference.meta first, which caches it before propensity runs its new import — so
the cycle never re-triggers under pytest, but it breaks normal user entry points.
Fix: make the import function-local in the three methods that use it, or drop it — sklearn
≥1.6 accepts pandas/Polars natively, so PropensityModel.fit/predict may not need the
conversion at all.
2. polars is not declared as a dependency → the test file is skipped in CI
pyproject.toml is untouched, and tests/test_polars_support.py starts with
pytest.importorskip("polars"), so the entire feature is silently skipped in CI (same point
from the 2026-05-28 round). The code is already optional-ready (try/except in utils.py,
guarded import polars in tlearner.py), so keep polars optional — don't add it to core
[project.dependencies]. Add an optional extra and wire it into test so CI (which installs
-e ".[test]") actually runs the suite:
[project.optional-dependencies]
polars = ["polars>=1.0.0"] # users: pip install causalml[polars]
test = [
"pytest>=4.6",
"pytest-cov>=4.0",
"causalml[polars]", # self-referencing extra so .[test] pulls polars in CI
]A dev without polars still gets a clean importorskip skip; no --runpolars flag needed.
3. Docstrings deleted across S/X/R/DR
Every class / __init__ / fit / predict / fit_predict / estimate_ate docstring in
slearner.py, xlearner.py, rlearner.py, drlearner.py was removed, including the paper
references (Kennedy 2020, Nie & Wager 2019, Künzel et al. 2018). These render on readthedocs.
The T-learner correctly kept and updated its docstrings to mention pl.DataFrame — please
follow that pattern for the other four rather than deleting.
4. LazyFrame handling is inconsistent — and it points at the core design rule
Only the T-learner guards predict with if isinstance(X, pl.LazyFrame): X = X.collect().
In S/X/R/DR, predict passes X to prepend_column / model.predict directly — and
prepend_column/concat_treatment_col have no LazyFrame branch and call len(X), which
raises on a pl.LazyFrame. It's untested because the only test_lazyframe_input lives in
TestTLearnerPolars.
We want to keep native end-to-end (that's the whole point — down-converting X to numpy
would throw away the Polars benefit, and matters in particular for the xgboost zero-copy path
and column-name-aware Pipeline/ColumnTransformer learners). The fix is to apply the
contract consistently:
X stays native end-to-end;
treatment/y/p/sample_weightnormalize to numpy at
entry. Those are 1-D vectors that masking/np.unique/.astypeneed, and they're unrelated
to the wide-frame promise.
Concretely:
- Collect
LazyFrameonce at the top of each public method into apl.DataFrame(not
numpy). You have to collect to row-mask anyway, so this is the natural single point. After
that, the helpers only need to handlepl.DataFrame/pl.Series, and S/X/R/DR get LazyFrame
support for free. - Never
to_numpy(X)just to read a row count —drlearner.bootstrapdoes
to_numpy(X).shape[0], and thete = np.zeros((X.shape[0], ...))allocations should read
the count natively (X.shape[0]/len(X)work for numpy/pandas/polars). - Let
bootstrapresample on the native X viafilter_index(callers currently pre-convert
toX_npbefore calling it). - Tests to lock the contract in: numpy == pandas == polars equivalence for every learner
(regressor and classifier); a fake learner assertingisinstance(X, pl.DataFrame)inside
fit/predictso a reintroducedto_numpy(X)fails loudly; a no-feature-name-warning
assertion on the fit-DataFrame/predict-DataFrame path; and a by-namePipelinelearner.
Note: lightgbm still has the sklearn-API Polars bug (lightgbm-org/LightGBM#6849), so native will
break it — document as a caveat or convert only at the lightgbm boundary.
Non-blocking polish (smaller correctness + test-coverage items) is in a follow-up note so it
doesn't clutter the merge gate. Happy to re-review quickly once the four above are addressed.
|
@aman-coder03, can you check and fix the build error? |
|
done @jeongyoonlee |
jeongyoonlee
left a comment
There was a problem hiding this comment.
The native-X approach looks right and matches the existing sklearn >=1.6 floor, but the diff carries some merge cruft and is broader than the feature itself. Two things before merge:
- The PR description is stale: it says the individual learner files weren't changed (all five were), and that
convert_pd_to_np()calls were added topropensity.pyfit/predict/compute_propensity_score (the diff adds none — only docstrings and a.copy()guard). Please update it to match what shipped. - The classifier Polars paths (
BaseXClassifier/BaseTClassifier/BaseSClassifier) have no tests, yet they're the most heavily rewritten code in this PR. Please add coverage mirroring the regressor tests.
Inline notes below — none are correctness-breaking, but the duplicated/dead lines should be cleaned up.
| # Build separate frames for control and treatment to avoid in-place | ||
| # mutation, which fails when learners like CatBoost set the | ||
| # writeable flag to False on arrays passed to predict(). | ||
| X_new_c = prepend_column(0.0, X) |
There was a problem hiding this comment.
Lines 132-133 (X_new_c/X_new_t = np.hstack(...)) are now dead — they're overwritten by prepend_column() here before use — and they force a full numpy copy of X on every predict(), which defeats the native-X path this PR adds. Remove them.
| yhat_cs[group] = model.predict(X_new_c) | ||
|
|
||
| X_new_t = prepend_column(1.0, X) | ||
| yhat_cs[group] = model.predict(X_new_c) |
There was a problem hiding this comment.
Duplicate: yhat_cs[group] was already computed at line 142, so this re-runs model.predict(X_new_c) for nothing. Drop this line.
| yhat_cs[group] = model.predict_proba(X_new_c)[:, 1] | ||
|
|
||
| X_new_t = prepend_column(1.0, X) | ||
| yhat_cs[group] = model.predict_proba(X_new_c)[:, 1] |
There was a problem hiding this comment.
Same duplicate as the regressor: yhat_cs[group] was set at line 392; this repeats predict_proba(X_new_c). Drop it.
|
|
||
| te = np.zeros((n_rows(X), self.t_groups.shape[0])) | ||
| yhat_cs = {} | ||
| te = np.zeros((X.shape[0], self.t_groups.shape[0])) |
There was a problem hiding this comment.
Duplicate te allocation: line 272 already set it via the Polars-safe n_rows(X), and this line overwrites it with X.shape[0]. Remove this line and keep 272.
|
|
||
| te = np.zeros((n_rows(X), self.t_groups.shape[0])) | ||
| yhat_cs = {} | ||
| te = np.zeros((X.shape[0], self.t_groups.shape[0])) |
There was a problem hiding this comment.
Same duplicate te as BaseDRLearner.predict: line 634 already set it via n_rows(X). Remove this line.
| yhat[w == 1] = yhat_ts[group][mask][w == 1] | ||
|
|
||
| logger.info("Error metrics for group {}".format(group)) | ||
| from causalml.metrics import classification_metrics |
There was a problem hiding this comment.
Redundant local import — classification_metrics is already imported at module top (line 19). Remove.
| p = self._format_p(p, self.t_groups) | ||
|
|
||
| self._classes = {group: i for i, group in enumerate(self.t_groups)} | ||
| self.models_mu_c = {group: deepcopy(self.model_mu_c) for group in self.t_groups} |
There was a problem hiding this comment.
This fits a per-group control model, but every group's mask selects the full control set, so for multi-treatment data you now train N identical control models (the regressor still shares one — line 137). Results are unchanged but it's a perf regression; consider keeping the shared model or noting why it was dropped.
| # Preserve the unfitted template so repeated fit() calls always start fresh. | ||
| self._model_mu_c_template = self.model_mu_c |
There was a problem hiding this comment.
We need this for fit() to start fresh.
| control_mask = treatment_np == self.control_name | ||
| X_control = filter_mask(X, control_mask) | ||
| y_control = to_numpy(filter_mask(y, control_mask)) | ||
| self.model_mu_c = deepcopy(self.model_mu_c) |
There was a problem hiding this comment.
We should keep master, i.e., deepcopying the unfit model_mu_c_template instead of model_mu_c to start fresh at each call to fit().
| self.model_mu_c = deepcopy(self._model_mu_c_template) | ||
| self.model_mu_c.fit(X[control_mask], y[control_mask]) | ||
| # model_mu_c is trained on control only (identical across groups) — fit once. | ||
| control_mask = treatment_np == self.control_name | ||
| X_control = filter_mask(X, control_mask) | ||
| y_control = filter_mask(y, control_mask) | ||
| self.model_mu_c = deepcopy(self.model_mu_c) |
There was a problem hiding this comment.
same as above. copy model_mu_c_template instead of model_mu_c
Proposed changes
Closes #855
Adds native Polars support to CausalML, keeping feature matrices (
X) in their native format (numpy / pandas / polars) end-to-end across all five meta-learners (T, S, X, R, DR)What changed
causalml/inference/meta/utils.pyaddedcollect_if_lazy(X),n_rows(X),filter_mask(),filter_index(),prepend_column(),concat_treatment_col(), andto_numpy(). Keptconvert_pd_to_npas a deprecated backward-compat alias forexplainer.pyand other existing callers.causalml/inference/meta/base.py....bootstrap()resamples viafilter_index(X, idxs)so X stays native;_format_p()usesto_numpy;_set_propensity_models()filters X natively and passes it as-is to sklearn (≥ 1.6 accepts pandas/Polars).causalml/inference/meta/tlearner.py....collect_if_lazy(X)at the top of every public method;model_cfitted once on the full control set and exposed as a shared-reference dict;store_bootstraps/return_ciAPI merged from upstream Follow-ups from #886: bootstrap clone(safe=False) deepcopies fitted models; predict() validation ordering #904.causalml/inference/meta/slearner.py....np.hstackreplaced withconcat_treatment_col/prepend_column(type-safe across numpy/pandas/polars).causalml/inference/meta/xlearner.py....model_mu_cfitted once on the full control set (regressor:.predict; classifier:.predict_proba); exposed as a shared-reference dict;self.var_cstored as a finite scalar.causalml/inference/meta/rlearner.py....cross_val_predictreceives native X;treatment/y/sample_weightnormalised to numpy at entry.causalml/inference/meta/drlearner.py.... KFold partition slices usefilter_index(X, idx)so X stays native across all three cross-fit folds.causalml/propensity.py.... removed top-level import ofconvert_pd_to_np(caused circular import on cold import ofcausalml.propensity);PropensityModel.fit/predictandcompute_propensity_scorepass X straight through to sklearn/XGBoost which accept pandas/Polars natively.Design decisions
try/except ImportErrorguard inutils.py;polars>=1.0.0declared as[project.optional-dependencies] polarsand pulled into[test]so CI runs the suitepl.LazyFramesupport ....collected once topl.DataFrameat the public-method boundary viacollect_if_lazy(); all helpers work onpl.DataFrame/pl.SeriesonlyTesting
Added
tests/test_polars_support.pywith coverage across all five learner types and both regressors and classifiers (BaseTClassifier, BaseSClassifier, BaseXClassifier), verifying that numpy/pandas/polars/LazyFrame inputs produce identical results, and covering edge cases like mixed inputs and fit-on-numpy/predict-on-polars.Types of changes
What types of changes does your code introduce to CausalML?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Further comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.