Skip to content

Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901

Open
aman-coder03 wants to merge 12 commits into
uber:masterfrom
aman-coder03:feature/polars-support
Open

Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901
aman-coder03 wants to merge 12 commits into
uber:masterfrom
aman-coder03:feature/polars-support

Conversation

@aman-coder03

@aman-coder03 aman-coder03 commented May 23, 2026

Copy link
Copy Markdown
Contributor

Proposed changes

Closes #855

Adds native Polars support to CausalML, keeping feature matrices (X) in their native format (numpy / pandas / polars) end-to-end across all five meta-learners (T, S, X, R, DR)

What changed

  • causalml/inference/meta/utils.py added collect_if_lazy(X), n_rows(X), filter_mask(), filter_index(), prepend_column(), concat_treatment_col(), and to_numpy(). Kept convert_pd_to_np as a deprecated backward-compat alias for explainer.py and other existing callers.
  • causalml/inference/meta/base.py.... bootstrap() resamples via filter_index(X, idxs) so X stays native; _format_p() uses to_numpy; _set_propensity_models() filters X natively and passes it as-is to sklearn (≥ 1.6 accepts pandas/Polars).
  • causalml/inference/meta/tlearner.py ....collect_if_lazy(X) at the top of every public method; model_c fitted once on the full control set and exposed as a shared-reference dict; store_bootstraps/return_ci API merged from upstream Follow-ups from #886: bootstrap clone(safe=False) deepcopies fitted models; predict() validation ordering #904.
  • causalml/inference/meta/slearner.py....np.hstack replaced with concat_treatment_col/prepend_column (type-safe across numpy/pandas/polars).
  • causalml/inference/meta/xlearner.py .... model_mu_c fitted once on the full control set (regressor: .predict; classifier: .predict_proba); exposed as a shared-reference dict; self.var_c stored as a finite scalar.
  • causalml/inference/meta/rlearner.py.... cross_val_predict receives native X; treatment/y/sample_weight normalised to numpy at entry.
  • causalml/inference/meta/drlearner.py .... KFold partition slices use filter_index(X, idx) so X stays native across all three cross-fit folds.
  • causalml/propensity.py.... removed top-level import of convert_pd_to_np (caused circular import on cold import of causalml.propensity); PropensityModel.fit/predict and compute_propensity_score pass X straight through to sklearn/XGBoost which accept pandas/Polars natively.

Design decisions

  • Polars is an optional dependency ....try/except ImportError guard in utils.py; polars>=1.0.0 declared as [project.optional-dependencies] polars and pulled into [test] so CI runs the suite
  • pl.LazyFrame support ....collected once to pl.DataFrame at the public-method boundary via collect_if_lazy(); all helpers work on pl.DataFrame/pl.Series only
  • Return types unchanged .... all methods still return numpy arrays
  • LightGBM caveat .... LightGBM has a known sklearn-API bug with Polars ([python-package] Fitting on Polars Dataframe fails due to missing setter for fetures_names_in_ lightgbm-org/LightGBM#6849); users should convert X to numpy/pandas when using LightGBM as a base learner

Testing

Added tests/test_polars_support.py with coverage across all five learner types and both regressors and classifiers (BaseTClassifier, BaseSClassifier, BaseXClassifier), verifying that numpy/pandas/polars/LazyFrame inputs produce identical results, and covering edge cases like mixed inputs and fit-on-numpy/predict-on-polars.

Types of changes

What types of changes does your code introduce to CausalML?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.

@jeongyoonlee

Copy link
Copy Markdown
Collaborator

Thanks for looking into this, @aman-coder03.

As I commented on #855, converting the dataframe into numpy is not ideal for Polars, as it doesn't benefit from its performant features. Instead, adding support for native Pandas and Polars DataFrames is recommended. It will require careful inspection of the current indexing to make it comparable to the indexing on DataFrames.

@aman-coder03

Copy link
Copy Markdown
Contributor Author

thanks for the feedback @jeongyoonlee

i understand the concern....converting to NumPy at the boundary means Polars users lose all the performance benefits(lazy evaluation, zero-copy operations, columnar efficiency) that they came for in the first place

for the native approach, my plan would be...

  • add thin abstraction layer in utils.py with helpers like filter_by_mask(), concat_cols(), and get_values() that dispatch to the correct pandas/polars/numpy operation based on the input type
  • audit every indexing pattern across all five learner files(tlearner.py, slearner.py, xlearner.py, rlearner.py, drlearner.py) then replace them with these helpers
  • key patterns to handle are: boolean mask filtering (X[mask]), column concatenation (np.hstack), and in-place mutation(Polars is immutable)
  • keep convert_pd_to_np() only at the final NumPy-only boundaries(sklearn's cross_val_predict in the R-learner), since sklearn doesn't accept Polars natively

Before I go ahead and rewrite, a couple of questions to make sure i am heading in the right direction...

  1. should the output of predict() remain a NumPy array, or would you like an option to return a Polars DataFrame when the input was Polars?
  2. for sklearn calls like cross_val_predict (R-learner) and propensity model fitting, converting to NumPy at that specific boundary seems unavoidable....does that approach work for you, or do you have a preferred alternative?

Happy to update the PR once we are aligned

@jeongyoonlee

Copy link
Copy Markdown
Collaborator

Thanks for the detailed plan, @aman-coder03. One important correction before you start: converting to NumPy at the sklearn boundary isn't necessary.

  • scikit-learn ≥ 1.4 accepts Polars (and pandas) DataFrames natively via the DataFrame Interchange Protocol (release notes); causalml pins scikit-learn>=1.6.0.
  • XGBoost ≥ 3.1 accepts pl.DataFrame and pl.LazyFrame directly (docs).
  • LightGBM nominally accepts pl.DataFrame but has a known sklearn-API bug (issue #6849) — document as a caveat.

So the rewrite scope collapses to causalml's own indexing/concat/mutation — not the model calls.

Answers to your questions:

  1. predict() output type. Keep returning NumPy. The entire underlying stack does the same: scikit-learn's .predict() returns NumPy regardless of input (only transformers honor set_output("polars"), not predictors), XGBoost's .predict() returns NumPy (or cupy.ndarray on GPU) even when fed a pl.DataFrame via its zero-copy Arrow path, and LightGBM's .predict() returns NumPy as documented. Matching that convention keeps causalml's API consistent with sklearn/xgboost/lightgbm and preserves backward compatibility. A return_type="polars" opt-in can be added later if users ask.

  2. NumPy conversion at sklearn boundaries. Skip it — pass Polars/pandas straight through. Zero materialization overhead, and dtype/feature-name plumbing is handled by the ML library.

Suggested phasing:

  • Phase 1: Remove convert_pd_to_np() in favor of native pandas DataFrame support across all meta-learners. No Polars yet — this isolates the indexing-polymorphism work and gives a clean baseline.
  • Phase 2: Add Polars support to one estimator (e.g. BaseTLearner) as a reference implementation with equivalence tests (NumPy / pandas / Polars produce identical te).
  • Phase 3+: Extend Polars to the remaining learners (S/X/R/DR), one PR each.

A few practical notes for the audit:

  • Enumerate the actual patterns first. Grep all five learners for indexing/concat/mutation (X[mask], np.hstack, .iloc, in-place assignment) and design the minimal helper set from real usage, not guesses.
  • Polars is immutable — anywhere current code does X[mask] = value or accumulates by index assignment, you'll need an explicit rewrite (with_columns, when/then/otherwise), not just a dispatch helper.
  • Add polars to test deps in Phase 2. The current PR's tests don't run in CI because Polars isn't installed there — pytest.importorskip("polars") skips the whole file silently.

Happy to review Phase 1 once it's ready.

@aman-coder03

Copy link
Copy Markdown
Contributor Author

thanks for the detailed feedback @jeongyoonlee
I understand the phasing suggestion, but I've intentionally tried to complete the full Polars support in this single PR....covering all five learners (T/S/X/R/DR) together. My reasoning was that the indexing patterns are largely the same across learners, so the helper set (filter_mask, filter_index, prepend_column, etc.) naturally fell out of auditing all of them at once rather than discovering gaps PR by PR.

i'm happy to reorganize if you'd prefer the phased approach...I can split this into separate PRs for Phase 1 (pandas cleanup) and Phase 2/3 (Polars per learner). But if the current implementation looks sound to you, I'd love to land it as is to avoid rebasing overhead

@jeongyoonlee jeongyoonlee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this through to a full implementation, @aman-coder03 — the helper-based
dispatch (filter_mask / filter_index / prepend_column / concat_treatment_col) is the
right abstraction, and the T-learner is a good reference (docstrings kept + updated, explicit
LazyFrame.collect() guard). The direction is solid.

One process note before the specifics: this landed all of Phase 1–3 plus a large cosmetic
refactor in a single ~1k-line PR, which makes it hard to review and bisect. I'm not going to
ask you to re-split it now, but the four items below are blocking regardless of how it's
packaged.

Blocking

1. Circular import — import causalml.propensity crashes on a cold import

propensity.py adds a top-level from causalml.inference.meta.utils import convert_pd_to_np.
Importing that submodule runs meta/__init__.pyslearnerbase.py (from causalml.propensity import compute_propensity_score) back into the half-initialized
propensity module. Reproduced by applying just that one line:

File ".../causalml/propensity.py", line 11, in <module>
    from causalml.inference.meta.utils import convert_pd_to_np
File ".../causalml/inference/meta/__init__.py", line 1, in <module>
    from .slearner import LRSRegressor, BaseSLearner, BaseSRegressor, BaseSClassifier
File ".../causalml/inference/meta/slearner.py", line 9, in <module>
    from causalml.inference.meta.base import BaseLearner
File ".../causalml/inference/meta/base.py", line 8, in <module>
    from causalml.propensity import compute_propensity_score
ImportError: cannot import name 'compute_propensity_score' from partially initialized
module 'causalml.propensity' (most likely due to a circular import)

Both import causalml.propensity and from causalml.propensity import ElasticNetPropensityModel fail cold. It passes locally only because the test process imports
causalml.inference.meta first, which caches it before propensity runs its new import — so
the cycle never re-triggers under pytest, but it breaks normal user entry points.

Fix: make the import function-local in the three methods that use it, or drop it — sklearn
≥1.6 accepts pandas/Polars natively, so PropensityModel.fit/predict may not need the
conversion at all.

2. polars is not declared as a dependency → the test file is skipped in CI

pyproject.toml is untouched, and tests/test_polars_support.py starts with
pytest.importorskip("polars"), so the entire feature is silently skipped in CI (same point
from the 2026-05-28 round). The code is already optional-ready (try/except in utils.py,
guarded import polars in tlearner.py), so keep polars optional — don't add it to core
[project.dependencies]. Add an optional extra and wire it into test so CI (which installs
-e ".[test]") actually runs the suite:

[project.optional-dependencies]
polars = ["polars>=1.0.0"]      # users: pip install causalml[polars]
test = [
    "pytest>=4.6",
    "pytest-cov>=4.0",
    "causalml[polars]",          # self-referencing extra so .[test] pulls polars in CI
]

A dev without polars still gets a clean importorskip skip; no --runpolars flag needed.

3. Docstrings deleted across S/X/R/DR

Every class / __init__ / fit / predict / fit_predict / estimate_ate docstring in
slearner.py, xlearner.py, rlearner.py, drlearner.py was removed, including the paper
references (Kennedy 2020, Nie & Wager 2019, Künzel et al. 2018). These render on readthedocs.
The T-learner correctly kept and updated its docstrings to mention pl.DataFrame — please
follow that pattern for the other four rather than deleting.

4. LazyFrame handling is inconsistent — and it points at the core design rule

Only the T-learner guards predict with if isinstance(X, pl.LazyFrame): X = X.collect().
In S/X/R/DR, predict passes X to prepend_column / model.predict directly — and
prepend_column/concat_treatment_col have no LazyFrame branch and call len(X), which
raises on a pl.LazyFrame. It's untested because the only test_lazyframe_input lives in
TestTLearnerPolars.

We want to keep native end-to-end (that's the whole point — down-converting X to numpy
would throw away the Polars benefit, and matters in particular for the xgboost zero-copy path
and column-name-aware Pipeline/ColumnTransformer learners). The fix is to apply the
contract consistently:

X stays native end-to-end; treatment / y / p / sample_weight normalize to numpy at
entry.
Those are 1-D vectors that masking/np.unique/.astype need, and they're unrelated
to the wide-frame promise.

Concretely:

  • Collect LazyFrame once at the top of each public method into a pl.DataFrame (not
    numpy). You have to collect to row-mask anyway, so this is the natural single point. After
    that, the helpers only need to handle pl.DataFrame/pl.Series, and S/X/R/DR get LazyFrame
    support for free.
  • Never to_numpy(X) just to read a row countdrlearner.bootstrap does
    to_numpy(X).shape[0], and the te = np.zeros((X.shape[0], ...)) allocations should read
    the count natively (X.shape[0] / len(X) work for numpy/pandas/polars).
  • Let bootstrap resample on the native X via filter_index (callers currently pre-convert
    to X_np before calling it).
  • Tests to lock the contract in: numpy == pandas == polars equivalence for every learner
    (regressor and classifier); a fake learner asserting isinstance(X, pl.DataFrame) inside
    fit/predict so a reintroduced to_numpy(X) fails loudly; a no-feature-name-warning
    assertion on the fit-DataFrame/predict-DataFrame path; and a by-name Pipeline learner.

Note: lightgbm still has the sklearn-API Polars bug (lightgbm-org/LightGBM#6849), so native will
break it — document as a caveat or convert only at the lightgbm boundary.


Non-blocking polish (smaller correctness + test-coverage items) is in a follow-up note so it
doesn't clutter the merge gate. Happy to re-review quickly once the four above are addressed.

@jeongyoonlee

Copy link
Copy Markdown
Collaborator

@aman-coder03, can you check and fix the build error?

@aman-coder03

Copy link
Copy Markdown
Contributor Author

done @jeongyoonlee

@jeongyoonlee jeongyoonlee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native-X approach looks right and matches the existing sklearn >=1.6 floor, but the diff carries some merge cruft and is broader than the feature itself. Two things before merge:

  • The PR description is stale: it says the individual learner files weren't changed (all five were), and that convert_pd_to_np() calls were added to propensity.py fit/predict/compute_propensity_score (the diff adds none — only docstrings and a .copy() guard). Please update it to match what shipped.
  • The classifier Polars paths (BaseXClassifier/BaseTClassifier/BaseSClassifier) have no tests, yet they're the most heavily rewritten code in this PR. Please add coverage mirroring the regressor tests.

Inline notes below — none are correctness-breaking, but the duplicated/dead lines should be cleaned up.

# Build separate frames for control and treatment to avoid in-place
# mutation, which fails when learners like CatBoost set the
# writeable flag to False on arrays passed to predict().
X_new_c = prepend_column(0.0, X)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 132-133 (X_new_c/X_new_t = np.hstack(...)) are now dead — they're overwritten by prepend_column() here before use — and they force a full numpy copy of X on every predict(), which defeats the native-X path this PR adds. Remove them.

yhat_cs[group] = model.predict(X_new_c)

X_new_t = prepend_column(1.0, X)
yhat_cs[group] = model.predict(X_new_c)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate: yhat_cs[group] was already computed at line 142, so this re-runs model.predict(X_new_c) for nothing. Drop this line.

yhat_cs[group] = model.predict_proba(X_new_c)[:, 1]

X_new_t = prepend_column(1.0, X)
yhat_cs[group] = model.predict_proba(X_new_c)[:, 1]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same duplicate as the regressor: yhat_cs[group] was set at line 392; this repeats predict_proba(X_new_c). Drop it.

Comment thread causalml/inference/meta/drlearner.py Outdated

te = np.zeros((n_rows(X), self.t_groups.shape[0]))
yhat_cs = {}
te = np.zeros((X.shape[0], self.t_groups.shape[0]))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate te allocation: line 272 already set it via the Polars-safe n_rows(X), and this line overwrites it with X.shape[0]. Remove this line and keep 272.

Comment thread causalml/inference/meta/drlearner.py Outdated

te = np.zeros((n_rows(X), self.t_groups.shape[0]))
yhat_cs = {}
te = np.zeros((X.shape[0], self.t_groups.shape[0]))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same duplicate te as BaseDRLearner.predict: line 634 already set it via n_rows(X). Remove this line.

Comment thread causalml/inference/meta/tlearner.py Outdated
yhat[w == 1] = yhat_ts[group][mask][w == 1]

logger.info("Error metrics for group {}".format(group))
from causalml.metrics import classification_metrics

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant local import — classification_metrics is already imported at module top (line 19). Remove.

Comment thread causalml/inference/meta/xlearner.py Outdated
p = self._format_p(p, self.t_groups)

self._classes = {group: i for i, group in enumerate(self.t_groups)}
self.models_mu_c = {group: deepcopy(self.model_mu_c) for group in self.t_groups}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fits a per-group control model, but every group's mask selects the full control set, so for multi-treatment data you now train N identical control models (the regressor still shares one — line 137). Results are unchanged but it's a perf regression; consider keeping the shared model or noting why it was dropped.

@jeongyoonlee jeongyoonlee added the enhancement New feature or request label Jun 20, 2026
Comment on lines -59 to -60
# Preserve the unfitted template so repeated fit() calls always start fresh.
self._model_mu_c_template = self.model_mu_c

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this for fit() to start fresh.

control_mask = treatment_np == self.control_name
X_control = filter_mask(X, control_mask)
y_control = to_numpy(filter_mask(y, control_mask))
self.model_mu_c = deepcopy(self.model_mu_c)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep master, i.e., deepcopying the unfit model_mu_c_template instead of model_mu_c to start fresh at each call to fit().

Comment on lines -558 to +566
self.model_mu_c = deepcopy(self._model_mu_c_template)
self.model_mu_c.fit(X[control_mask], y[control_mask])
# model_mu_c is trained on control only (identical across groups) — fit once.
control_mask = treatment_np == self.control_name
X_control = filter_mask(X, control_mask)
y_control = filter_mask(y, control_mask)
self.model_mu_c = deepcopy(self.model_mu_c)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. copy model_mu_c_template instead of model_mu_c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Polars support on CausalML

2 participants