Fix selection direction, scorer handling, and fit kwargs; resolve sktime doctest#182
Conversation
|
|
||
| best_index = np.argmin(scores) | ||
| # choose selection direction based on experiment tag | ||
| hib = experiment.get_tag("property:higher_or_lower_is_better", "higher") |
There was a problem hiding this comment.
something must be wrong, I do not think a case distinction should happen here. Per design, score always returns "higher is better"
| ) | ||
|
|
||
| best_index = int(np.argmin(scores)) # lower-is-better convention | ||
| hib = experiment.get_tag("property:higher_or_lower_is_better", "higher") |
There was a problem hiding this comment.
same here, no case distinction should happen here
fkiraly
left a comment
There was a problem hiding this comment.
I am not sure if this is correct - are we making an error of sign somewhere? At least, in the optimizers, nothing should change in my opinion, as per the design, score always returns a "higher is better" score.
What I would also suggest: wherever you noticed that the wrong parameters were selected, let's add this as a test case. General principle, if a bug gets fixed, and there was no test failure prior, a test should be added that fails before and runs after. I presume, it would be using a "naive experiment" where we know what the best parameters are? Or one of the toy test functions?
| # store public attributes | ||
| self.best_index_ = best_index | ||
| self.best_score_ = scores[best_index] | ||
| self.best_score_ = float(scores[best_index]) |
There was a problem hiding this comment.
is this necessary if the score function internally also does float coercion? I think if we have contracts, we should rely on contracts (instead of anticipating non-conformance)
| metric_func = getattr(scorer, "_score_func", None) | ||
| if metric_func is None: | ||
| metric_func = _default_metric_for(estimator) | ||
| try: |
There was a problem hiding this comment.
this feels risky, can we avoid this?
fkiraly
left a comment
There was a problem hiding this comment.
Looks good!
One issue that I have with _coerce_to_scorer is that its guarantees are no longer seem to be met, i.e., that it returns an sklearn scorer, is this true? The try/except block strikes me as particularly hacky, what are we trying to "fix"?
One option to avoid this could be to rework the metric and wrap things in a stable scorer interface that is always guaranteed to work, that way the coupling (that you are probably trying to address) that has optimizers reach into _scoring etc is no longer needed.
How about that?
|
tried to refactor it - feel free to revert if you do not like it |
|
(from my side, this is all fine now) |
fkiraly
left a comment
There was a problem hiding this comment.
I would suggest:
- check what is happening with the jupyter notebook - why is it reformatted?
- I would recommend we add a test for the failing sign. I think doing grid search on one of the toy datasets and checking explicitly for optimal parameters should ensure the sign is correct.
| f"Optimizer should select argmax of standardized score. " | ||
| f"Expected {good}, got {best_params}." | ||
| ) | ||
|
|
There was a problem hiding this comment.
I think, you can avoid lots of repetition by using set_params, i.e., inst = object_instance.clone().set_params({"experiment": exp, "param_space": see_below}).
Besides this, can we avoid hard-coding a lot of these parameters per estimator? This is not too extensible. It is fine if we use it primarily for checking the sign, but I wonder whether we can avoid all the hard coding.
Problems
params(argmin on signed scores) and didn’t consistently set best_* attributes.kwargswere silently dropped by the verify_fit decorator.scorer._score_func(not present for _PassthroughScorer).experiment(**params), which callsscore(). score()applies a sign flip based on the experiment tag, so Grid/Random search were selecting on “signed” scores and then also making assumptions aboutdirection (previously hardcoded argmin).
Solutions
evaluate()values; setbest_params_,best_index_, and compute signedbest_score_viaexperiment.score(...)._coerce_to_scorernow attaches asafe ._metric_funcfallback (e.g., accuracy/r2) and robust sign inference.verify_fitnow preserves *args, **kwargs and marks fit success.score()raises on "mixed" to avoid undefined behavior (users should define a concrete direction or override)._score_paramsnow returns the rawevaluate()value (float), not the signedscore(). Selecting the best config should use raw objective values and then choose min or max based on the tag (higher/lower). This removes ambiguity, avoids double sign logic, and makes selection correct and explicit. We still compute the publicbest_score_viaexperiment.score(best_params)so external consumers see the standardized “higher-is-better”score_.