feat: update prompts and descriptions for data science components (microsoft#731)

you-n-g · web-flow · commit 34eb4e883bb0 · 2025-04-01T20:54:42.000+08:00
* docs: Update prompts and descriptions for data science components

* chore: Remove outdated comments from conf.py

* feat: Add metric_name attribute to DataScienceScen class

* style: Update description in prompts.yaml and reorder metric_name init

* docs: Update prompts.yaml with feature engineering guidelines
diff --git a/rdagent/app/data_science/conf.py b/rdagent/app/data_science/conf.py
@@ -23,11 +23,6 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
 
     #### enable specification
     spec_enabled: bool = True
-    # - [ ] rdagent/components/coder/data_science/raw_data_loader/__init__.py: make spec implementation optional
-    # - [ ] move spec responsibility into  rdagent/scenarios/data_science/share.yaml
-    # - [ ] make all spec.md optional;  but replace it with the test & responsibility.   "spec/.*\.md".
-    # - [ ] replace yaml render with target test.  "spec > .yaml data_science !out_spec !task_spec model_spec"
-    # - [ ] At the head of all tests, emphasis the function to be tested.
 
 
 DS_RD_SETTING = DataScienceBasePropSetting()
diff --git a/rdagent/components/coder/data_science/ensemble/__init__.py b/rdagent/components/coder/data_science/ensemble/__init__.py
@@ -95,7 +95,8 @@ def implement_one_task(
                 .render(
                     model_names=[
                         fn[:-3] for fn in workspace.file_dict.keys() if fn.startswith("model_") and "test" not in fn
-                    ]
+                    ],
+                    metric_name=self.scen.metric_name,
                 )
             )
             code_spec = T("scenarios.data_science.share:component_spec.general").r(
diff --git a/rdagent/components/coder/data_science/raw_data_loader/prompts.yaml b/rdagent/components/coder/data_science/raw_data_loader/prompts.yaml
@@ -109,10 +109,11 @@ spec:
       2. Precautions for Feature Engineering:
         - Well handle the shape of the data:
           - The sample size of the train data and the test data should be the same in all scenarios.
-          - To most of the scenario, the input shape and the output shape should be exactly the same.
-          - To some tabular data, you may add or remove some columns so your inferred column number may be unsure.
+          - To some tabular or time-series data, you may add or remove some columns so your inferred column number may be unsure.
+          - For scenarios where each dimension does not have a special meaning (like image, audio, and so on), the input shape and the output shape should be exactly the same in most cases unless there is a compelling reason to change them.
         - Integration with the Model Pipeline:
           - If feature engineering is deferred to the model pipeline for better overall performance, state explicitly that it will be handled at the model stage.
+            - Model-related operations should not be implemented in this step. (e.g., it uses tools combined with models like torch.Dataset with rich data transformation/augmentation)
           - Otherwise, ensure this function applies all required transformations while avoiding data leakage.
         - General Considerations:
           - Ensure scalability for large datasets.
@@ -174,6 +175,7 @@ spec:
       4. Notes:
         - Align `DT` (data type) with the definitions used in Feature Engineering specifications.
         - The device has GPU support, so you are encouraged to use it for training if necessary to accelerate the process.
+        - Some data transformations/augmentations can be included in this step (e.g., data tools provided by TensorFlow and Torch)
 
       {% if latest_spec %}
       5. Former Specification:
diff --git a/rdagent/scenarios/data_science/proposal/prompts.yaml b/rdagent/scenarios/data_science/proposal/prompts.yaml
@@ -85,7 +85,7 @@ task_gen:
     {% endif %}
 
     {% if hypothesis is none %}
-    Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning. The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
+    Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning (but the model type should suit the task). The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
     {% else %}
     The user will use the {{ targets }} generated to do some experiments. The user will provide this information to you:
     1. The target hypothesis you are targeting to generate {{ targets }} for.
@@ -358,5 +358,5 @@ output_format:
     Design a specific and detailed workflow task based on the given hypothesis. The output should be detailed enough to directly implement the corresponding code.
     The output should follow JSON format. The schema is as follows:
     {
-        "description": "A precise and comprehensive description of the workflow",
+        "description": "A precise and comprehensive description of the main workflow script (`main.py`)",
     }
diff --git a/rdagent/scenarios/data_science/scen/__init__.py b/rdagent/scenarios/data_science/scen/__init__.py
@@ -228,6 +228,10 @@ class DataScienceScen(Scenario):
     """Data Science Scenario"""
 
     def __init__(self, competition: str) -> None:
+        self.metric_name: str | None = (
+            None  # It is None when initialization. After analysing, we'll assign the metric name
+        )
+
         self.competition = competition
         self.raw_description = self._get_description()
         self.processed_data_folder_description = self._get_data_folder_description()
diff --git a/rdagent/scenarios/data_science/scen/prompts.yaml b/rdagent/scenarios/data_science/scen/prompts.yaml
@@ -19,6 +19,7 @@ scenario_description: |-
   {% endif %}
 
   {% if eda_output is not none %}
+  ------Data Overview(EDA)------
   {{ eda_output }}
   {% endif %}
 
@@ -83,4 +84,4 @@ rich_style_description: |-
 
   #### [Objective](#_summary)
 
-  To automatically optimize performance metrics within the validation set, ultimately discovering the most efficient features and models through autonomous research and development.
+  To automatically optimize performance metrics within the validation set, ultimately discovering the most efficient features and models through autonomous research and development.
diff --git a/rdagent/scenarios/data_science/share.yaml b/rdagent/scenarios/data_science/share.yaml
@@ -67,8 +67,10 @@ component_description:
     Loads and preprocesses competition data, ensuring proper data types, handling missing values, and providing an exploratory data analysis summary.
   FeatureEng: |-
     Transforms raw data into meaningful features while maintaining shape consistency, avoiding data leakage, and optimizing for model performance.
+    It should be model-agnostic (data transformations/augmentations that apply only to specific model frameworks should not be included here).
   Model: |-
     Perform one of three tasks: model building, which develops a model to address the problem; model tuning, which optimizes an existing model for better performance; or model removal, which discards models that do not contribute effectively.
+    Handle data operations or augmentations that are closely tied to the model framework, such as tools provided by PyTorch or TensorFlow.
   Ensemble: |-
     Combines predictions from multiple models using ensemble strategies, evaluates their performance, and generates the final test predictions.
   Workflow: |-
@@ -106,11 +108,12 @@ component_spec:
   FeatureEng: |-
     1. Well handle the shape of the data
       - The sample size of the train data and the test data should be the same in all scenarios.
-      - To most of the scenario, the input shape and the output shape should be exactly the same.
-      - To some tabular data, you may add or remove some columns so your inferred column number may be unsure.
+      - To some tabular or time-series data, you may add or remove some columns so your inferred column number may be unsure.
+      - For scenarios where each dimension does not have a special meaning (like image, audio, and so on), the input shape and the output shape should be exactly the same in most cases unless there is a compelling reason to change them.
 
     2. Integration with the Model Pipeline:
       - If feature engineering is deferred to the model pipeline for better overall performance, state explicitly that it will be handled at the model stage.
+        - Model-related operations should not be implemented in this step. (e.g., it uses tools combined with models like torch.Dataset with rich data transformation/augmentation)
       - Otherwise, ensure this function applies all required transformations while avoiding data leakage.
 
     3. General Considerations:
@@ -129,6 +132,7 @@ component_spec:
   Model: |-
     - Do not use progress bars (e.g., `tqdm`) in the implementation.
     - The device has GPU support, so you are encouraged to use it for training if necessary to accelerate the process.
+    - Some data transformations/augmentations can be included in this step (e.g., data tools provided by TensorFlow and Torch)
   
   Ensemble: |-
     1. Input Validation:
@@ -225,4 +229,4 @@ component_spec:
       test_preds_dict[model_module.__name__] = test_pred
       {% endfor %}
       final_pred = ensemble_workflow(test_preds_dict, val_preds_dict, val_y)
-      {% endraw %}
+      {% endraw %}

Original file line number	Diff line number	Diff line change
`@@ -95,7 +95,8 @@ def implement_one_task(`
`95`	`95`	`.render(`
`96`	`96`	`model_names=[`
`97`	`97`	`fn[:-3] for fn in workspace.file_dict.keys() if fn.startswith("model_") and "test" not in fn`
`98`		`- ]`
	`98`	`+ ],`
	`99`	`+ metric_name=self.scen.metric_name,`
`99`	`100`	`)`
`100`	`101`	`)`
`101`	`102`	`code_spec = T("scenarios.data_science.share:component_spec.general").r(`
Original file line number	Diff line number	Diff line change
`@@ -85,7 +85,7 @@ task_gen:`
`85`	`85`	`{% endif %}`
`86`	`86`
`87`	`87`	`{% if hypothesis is none %}`
`88`		- Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning. The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
	`88`	+ Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning (but the model type should suit the task). The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
`89`	`89`	`{% else %}`
`90`	`90`	`The user will use the {{ targets }} generated to do some experiments. The user will provide this information to you:`
`91`	`91`	`1. The target hypothesis you are targeting to generate {{ targets }} for.`
`@@ -358,5 +358,5 @@ output_format:`
`358`	`358`	`Design a specific and detailed workflow task based on the given hypothesis. The output should be detailed enough to directly implement the corresponding code.`
`359`	`359`	`The output should follow JSON format. The schema is as follows:`
`360`	`360`	`{`
`361`		`- "description": "A precise and comprehensive description of the workflow",`
	`361`	+ "description": "A precise and comprehensive description of the main workflow script (`main.py`)",
`362`	`362`	`}`