fix: add ensemble test, change to "use cross-validation if possible" in workflow spec (microsoft#634)

XianBW · web-flow · commit acc97a821725 · 2025-02-24T20:28:32.000+08:00
* change to "use cross-validation if possible" in workflow spec

* Limit the evaluation indicator to only one

* add metric tips

* string change
diff --git a/rdagent/components/coder/data_science/ensemble/eval_tests/ensemble_test.txt b/rdagent/components/coder/data_science/ensemble/eval_tests/ensemble_test.txt
@@ -95,5 +95,6 @@ assert model_set_in_scores == set({{model_names}}).union({"ensemble"}), (
     f"The scores dataframe does not contain the correct model names as index.\ncorrect model names are: {{model_names}} + ['ensemble']\nscore_df is:\n{score_df}"
 )
 assert score_df.index.is_unique, "The scores dataframe has duplicate model names."
+assert len(score_df.columns) == 1, f"The scores dataframe should have exactly one column for the scores of the evaluation indicator, but has these columns: {score_df.columns.tolist()}"
 
 print("Ensemble test end.")
diff --git a/rdagent/components/coder/data_science/raw_data_loader/prompts.yaml b/rdagent/components/coder/data_science/raw_data_loader/prompts.yaml
@@ -204,7 +204,7 @@ spec:
           - Verify that `val_label` is provided and matches the length of `val_preds_dict` predictions.
           - Handle empty or invalid inputs gracefully with appropriate error messages.
         - Metric Calculation and Storage:
-          - Calculate the metric for each model and ensemble strategy, and save the results in `scores.csv`, e.g.:
+          - Calculate the metric (mentioned in the evaluation section of the competition information) for each model and ensemble strategy, and save the results in `scores.csv`, e.g.:
             ```python
             scores = {}
             for model_name, val_pred in val_preds_dict.items():
@@ -259,7 +259,7 @@ spec:
 
         3. Dataset Splitting
           - The dataset returned by `load_data` is not split into training and testing sets, so the dataset splitting should happen after calling `feat_eng`.
-          - Decide whether to use a **static train-test split** or **cross-validation**, based on what is most suitable given the `Competition Information`.
+          - Use cross-validation if possible, as it provides a more robust evaluation of the model's performance.
 
         4. Submission File:
           - Save the final predictions as `submission.csv`, ensuring the format matches the competition requirements (refer to `sample_submission` in the Folder Description for the correct structure).

Original file line number	Diff line number	Diff line change
`@@ -95,5 +95,6 @@ assert model_set_in_scores == set({{model_names}}).union({"ensemble"}), (`
`95`	`95`	`f"The scores dataframe does not contain the correct model names as index.\ncorrect model names are: {{model_names}} + ['ensemble']\nscore_df is:\n{score_df}"`
`96`	`96`	`)`
`97`	`97`	`assert score_df.index.is_unique, "The scores dataframe has duplicate model names."`
	`98`	`+assert len(score_df.columns) == 1, f"The scores dataframe should have exactly one column for the scores of the evaluation indicator, but has these columns: {score_df.columns.tolist()}"`
`98`	`99`
`99`	`100`	`print("Ensemble test end.")`