fix: refine prompt to generate the most simple task in init stage (microsoft#546)

peteryang1 · web-flow · commit 9d6feed28ce0 · 2025-01-27T00:51:40.000+08:00
* refine prompt to generate the most simple task in init stage

* feature test dtype check improve
diff --git a/rdagent/components/coder/data_science/feature/eval_tests/feature_test.txt b/rdagent/components/coder/data_science/feature/eval_tests/feature_test.txt
@@ -57,9 +57,14 @@ if isinstance(X, pd.DataFrame) and isinstance(X_test, pd.DataFrame):
     assert get_column_list(X) == get_column_list(X_test), "Mismatch in column names of training and test data."
 
 if isinstance(X, pd.DataFrame):
-    assert sorted(X.dtypes.unique().tolist()) == sorted(
-        X_loaded.dtypes.unique().tolist()
-    ), f"feature engineering has produced new data types which is not allowed, data loader data types are {X_loaded.dtypes.unique().tolist()} and feature engineering data types are {X.dtypes.unique().tolist()}"
+    X_dtypes_unique_sorted = sorted(X.dtypes.unique().tolist())
+    X_loaded_dtypes_unique_sorted = sorted(X_loaded.dtypes.unique().tolist())
+    assert (
+        len(X_loaded_dtypes_unique_sorted) == 1
+        and (X_loaded_dtypes_unique_sorted[0] == np.float64 or X_loaded_dtypes_unique_sorted[0] == np.float32)
+    ) or (
+        X_dtypes_unique_sorted == X_loaded_dtypes_unique_sorted
+    ), f"feature engineering has produced new data types which is not allowed, data loader data types are {X_loaded_dtypes_unique_sorted} and feature engineering data types are {X_dtypes_unique_sorted}"
 
 print(
     "Feature Engineering test passed successfully. All checks including length, width, and data types have been validated."
diff --git a/rdagent/scenarios/data_science/proposal/prompts.yaml b/rdagent/scenarios/data_science/proposal/prompts.yaml
@@ -73,7 +73,7 @@ task_gen: # It is deprecated now, please refer to direct_exp_gen
     {% if hypothesis is not none %}
     The user is trying to generate new {{ targets }} based on the hypothesis generated in the previous step. 
     {% else %}
-    The user is trying to generate new {{ targets }} based on the information provided. 
+    The user is trying to generate a very simple new {{ targets }} based on the information provided. 
     {% endif %}
     The {{ targets }} are used in certain scenario, the scenario is as follows:
     {{ scenario }}
@@ -84,7 +84,9 @@ task_gen: # It is deprecated now, please refer to direct_exp_gen
     Your task should adhere to the specification above.
     {% endif %}
 
-    {% if hypothesis is not none %}
+    {% if hypothesis is none %}
+    Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning. The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
+    {% else %}
     The user will use the {{ targets }} generated to do some experiments. The user will provide this information to you:
     1. The target hypothesis you are targeting to generate {{ targets }} for.
     2. The hypothesis generated in the previous steps and their corresponding feedbacks.
@@ -260,7 +262,7 @@ component_gen:
 
     Please select the component you are going to improve the latest implementation or sota implementation.
 
-    Please generate the output following the format below:
+    Please generate the output in JSON format following the format below:
     {{ component_output_format }}
 
   user: |-