Skip to content

[TO_REVIEW] Add automatic target label masking to prevent data leakage#330

Merged
tgnassou merged 14 commits intoscikit-adaptation:mainfrom
YanisLalou:auto_mask_target_labels
Sep 22, 2025
Merged

[TO_REVIEW] Add automatic target label masking to prevent data leakage#330
tgnassou merged 14 commits intoscikit-adaptation:mainfrom
YanisLalou:auto_mask_target_labels

Conversation

@YanisLalou
Copy link
Collaborator

This PR introduces a mechanism to automatically mask target labels in unsupervised domain adaptation settings. This feature prevent data leakage from the target domain during the fit process of the estimators.

Key Changes:

  • Automatic Label Masking: A new _auto_mask_target_labels method has been added to automatically replace target labels with a default masked value before they are passed to the estimators. This is enabled by default to ensure that no data leakage can occur.

  • Control via mask_target_labels parameter: The masking behavior can be controlled with the mask_target_labels parameter in make_da_pipeline and the selectors (Shared, PerDomain, etc.).

@codecov
Copy link

codecov bot commented Jun 25, 2025

Codecov Report

❌ Patch coverage is 98.21429% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.47%. Comparing base (98d6acc) to head (d20c7bb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #330      +/-   ##
==========================================
+ Coverage   96.18%   96.47%   +0.29%     
==========================================
  Files          63       50      -13     
  Lines        6919     6044     -875     
==========================================
- Hits         6655     5831     -824     
+ Misses        264      213      -51     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YanisLalou YanisLalou changed the title [WIP] Add automatic target label masking to prevent data leakage [TO_REVIEW] Add automatic target label masking to prevent data leakage Jun 25, 2025
PCA(n_components=2),
SelectSource(SVC()),
default_selector=SelectSourceTarget,
mask_target_labels=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it false here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why the use of SelectSourceTarget?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you do that you have one PCA for source anc one for target but SVC is traine donly on source

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should be able to mask target label with SelectSourceTarget no ? We don't want data leakage even if we have one PCA for source and one for target ?

skada/_utils.py Outdated
unmasked_idx = y != _DEFAULT_MASKED_TARGET_CLASSIFICATION_LABEL
elif y_type == Y_Type.CONTINUOUS:
unmasked_idx = np.isfinite(y)
if "sample_domain" in params:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these two lines, we avoid semi-supervised DA. I think it's a residue of before no?


clf = make_da_pipeline(DensityReweightAdapter(), mediator, FakeEstimator())
clf = make_da_pipeline(
Shared(DensityReweightAdapter(), mask_target_labels=False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it false here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we mask the target label it breaks when we use SelectTarget for the standard scaler. That means that with the selector SelectTarget, the source domain is not propagate in the pipeline :/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to fix in an other issue I think

@tgnassou tgnassou merged commit 8ccfd75 into scikit-adaptation:main Sep 22, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants