[TO_REVIEW] Add automatic target label masking to prevent data leakage#330
Conversation
…a da_pipeline with SelectSourceTarget
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #330 +/- ##
==========================================
+ Coverage 96.18% 96.47% +0.29%
==========================================
Files 63 50 -13
Lines 6919 6044 -875
==========================================
- Hits 6655 5831 -824
+ Misses 264 213 -51 🚀 New features to boost your workflow:
|
| PCA(n_components=2), | ||
| SelectSource(SVC()), | ||
| default_selector=SelectSourceTarget, | ||
| mask_target_labels=False, |
There was a problem hiding this comment.
I don't get why the use of SelectSourceTarget?
There was a problem hiding this comment.
when you do that you have one PCA for source anc one for target but SVC is traine donly on source
There was a problem hiding this comment.
So we should be able to mask target label with SelectSourceTarget no ? We don't want data leakage even if we have one PCA for source and one for target ?
skada/_utils.py
Outdated
| unmasked_idx = y != _DEFAULT_MASKED_TARGET_CLASSIFICATION_LABEL | ||
| elif y_type == Y_Type.CONTINUOUS: | ||
| unmasked_idx = np.isfinite(y) | ||
| if "sample_domain" in params: |
There was a problem hiding this comment.
With these two lines, we avoid semi-supervised DA. I think it's a residue of before no?
|
|
||
| clf = make_da_pipeline(DensityReweightAdapter(), mediator, FakeEstimator()) | ||
| clf = make_da_pipeline( | ||
| Shared(DensityReweightAdapter(), mask_target_labels=False), |
There was a problem hiding this comment.
If we mask the target label it breaks when we use SelectTarget for the standard scaler. That means that with the selector SelectTarget, the source domain is not propagate in the pipeline :/
There was a problem hiding this comment.
Something to fix in an other issue I think
This PR introduces a mechanism to automatically mask target labels in unsupervised domain adaptation settings. This feature prevent data leakage from the target domain during the fit process of the estimators.
Key Changes:
Automatic Label Masking: A new
_auto_mask_target_labelsmethod has been added to automatically replace target labels with a default masked value before they are passed to the estimators. This is enabled by default to ensure that no data leakage can occur.Control via mask_target_labels parameter: The masking behavior can be controlled with the
mask_target_labelsparameter inmake_da_pipelineand the selectors (Shared,PerDomain, etc.).