Skip to content

Commit 34eb4e8

Browse files
authored
feat: update prompts and descriptions for data science components (microsoft#731)
* docs: Update prompts and descriptions for data science components * chore: Remove outdated comments from conf.py * feat: Add metric_name attribute to DataScienceScen class * style: Update description in prompts.yaml and reorder metric_name init * docs: Update prompts.yaml with feature engineering guidelines
1 parent 7352755 commit 34eb4e8

File tree

7 files changed

+21
-14
lines changed

7 files changed

+21
-14
lines changed

rdagent/app/data_science/conf.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,6 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
2323

2424
#### enable specification
2525
spec_enabled: bool = True
26-
# - [ ] rdagent/components/coder/data_science/raw_data_loader/__init__.py: make spec implementation optional
27-
# - [ ] move spec responsibility into rdagent/scenarios/data_science/share.yaml
28-
# - [ ] make all spec.md optional; but replace it with the test & responsibility. "spec/.*\.md".
29-
# - [ ] replace yaml render with target test. "spec > .yaml data_science !out_spec !task_spec model_spec"
30-
# - [ ] At the head of all tests, emphasis the function to be tested.
3126

3227

3328
DS_RD_SETTING = DataScienceBasePropSetting()

rdagent/components/coder/data_science/ensemble/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,8 @@ def implement_one_task(
9595
.render(
9696
model_names=[
9797
fn[:-3] for fn in workspace.file_dict.keys() if fn.startswith("model_") and "test" not in fn
98-
]
98+
],
99+
metric_name=self.scen.metric_name,
99100
)
100101
)
101102
code_spec = T("scenarios.data_science.share:component_spec.general").r(

rdagent/components/coder/data_science/raw_data_loader/prompts.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,10 +109,11 @@ spec:
109109
2. Precautions for Feature Engineering:
110110
- Well handle the shape of the data:
111111
- The sample size of the train data and the test data should be the same in all scenarios.
112-
- To most of the scenario, the input shape and the output shape should be exactly the same.
113-
- To some tabular data, you may add or remove some columns so your inferred column number may be unsure.
112+
- To some tabular or time-series data, you may add or remove some columns so your inferred column number may be unsure.
113+
- For scenarios where each dimension does not have a special meaning (like image, audio, and so on), the input shape and the output shape should be exactly the same in most cases unless there is a compelling reason to change them.
114114
- Integration with the Model Pipeline:
115115
- If feature engineering is deferred to the model pipeline for better overall performance, state explicitly that it will be handled at the model stage.
116+
- Model-related operations should not be implemented in this step. (e.g., it uses tools combined with models like torch.Dataset with rich data transformation/augmentation)
116117
- Otherwise, ensure this function applies all required transformations while avoiding data leakage.
117118
- General Considerations:
118119
- Ensure scalability for large datasets.
@@ -174,6 +175,7 @@ spec:
174175
4. Notes:
175176
- Align `DT` (data type) with the definitions used in Feature Engineering specifications.
176177
- The device has GPU support, so you are encouraged to use it for training if necessary to accelerate the process.
178+
- Some data transformations/augmentations can be included in this step (e.g., data tools provided by TensorFlow and Torch)
177179
178180
{% if latest_spec %}
179181
5. Former Specification:

rdagent/scenarios/data_science/proposal/prompts.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ task_gen:
8585
{% endif %}
8686
8787
{% if hypothesis is none %}
88-
Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning. The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
88+
Since we are at the very beginning stage, we plan to start from a very simple task. To each component, please only generate the task to implement the most simple and basic function of the component. For example, the feature engineering should only implement the function which output the raw data without any transformation. The model component only uses the most basic and easy to implement model without any tuning (but the model type should suit the task). The ensemble component only uses the simplest ensemble method. The main focus at this stage is to build the first runnable version of the solution.
8989
{% else %}
9090
The user will use the {{ targets }} generated to do some experiments. The user will provide this information to you:
9191
1. The target hypothesis you are targeting to generate {{ targets }} for.
@@ -358,5 +358,5 @@ output_format:
358358
Design a specific and detailed workflow task based on the given hypothesis. The output should be detailed enough to directly implement the corresponding code.
359359
The output should follow JSON format. The schema is as follows:
360360
{
361-
"description": "A precise and comprehensive description of the workflow",
361+
"description": "A precise and comprehensive description of the main workflow script (`main.py`)",
362362
}

rdagent/scenarios/data_science/scen/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,10 @@ class DataScienceScen(Scenario):
228228
"""Data Science Scenario"""
229229

230230
def __init__(self, competition: str) -> None:
231+
self.metric_name: str | None = (
232+
None # It is None when initialization. After analysing, we'll assign the metric name
233+
)
234+
231235
self.competition = competition
232236
self.raw_description = self._get_description()
233237
self.processed_data_folder_description = self._get_data_folder_description()

rdagent/scenarios/data_science/scen/prompts.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ scenario_description: |-
1919
{% endif %}
2020
2121
{% if eda_output is not none %}
22+
------Data Overview(EDA)------
2223
{{ eda_output }}
2324
{% endif %}
2425
@@ -83,4 +84,4 @@ rich_style_description: |-
8384
8485
#### [Objective](#_summary)
8586
86-
To automatically optimize performance metrics within the validation set, ultimately discovering the most efficient features and models through autonomous research and development.
87+
To automatically optimize performance metrics within the validation set, ultimately discovering the most efficient features and models through autonomous research and development.

rdagent/scenarios/data_science/share.yaml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,10 @@ component_description:
6767
Loads and preprocesses competition data, ensuring proper data types, handling missing values, and providing an exploratory data analysis summary.
6868
FeatureEng: |-
6969
Transforms raw data into meaningful features while maintaining shape consistency, avoiding data leakage, and optimizing for model performance.
70+
It should be model-agnostic (data transformations/augmentations that apply only to specific model frameworks should not be included here).
7071
Model: |-
7172
Perform one of three tasks: model building, which develops a model to address the problem; model tuning, which optimizes an existing model for better performance; or model removal, which discards models that do not contribute effectively.
73+
Handle data operations or augmentations that are closely tied to the model framework, such as tools provided by PyTorch or TensorFlow.
7274
Ensemble: |-
7375
Combines predictions from multiple models using ensemble strategies, evaluates their performance, and generates the final test predictions.
7476
Workflow: |-
@@ -106,11 +108,12 @@ component_spec:
106108
FeatureEng: |-
107109
1. Well handle the shape of the data
108110
- The sample size of the train data and the test data should be the same in all scenarios.
109-
- To most of the scenario, the input shape and the output shape should be exactly the same.
110-
- To some tabular data, you may add or remove some columns so your inferred column number may be unsure.
111+
- To some tabular or time-series data, you may add or remove some columns so your inferred column number may be unsure.
112+
- For scenarios where each dimension does not have a special meaning (like image, audio, and so on), the input shape and the output shape should be exactly the same in most cases unless there is a compelling reason to change them.
111113
112114
2. Integration with the Model Pipeline:
113115
- If feature engineering is deferred to the model pipeline for better overall performance, state explicitly that it will be handled at the model stage.
116+
- Model-related operations should not be implemented in this step. (e.g., it uses tools combined with models like torch.Dataset with rich data transformation/augmentation)
114117
- Otherwise, ensure this function applies all required transformations while avoiding data leakage.
115118
116119
3. General Considerations:
@@ -129,6 +132,7 @@ component_spec:
129132
Model: |-
130133
- Do not use progress bars (e.g., `tqdm`) in the implementation.
131134
- The device has GPU support, so you are encouraged to use it for training if necessary to accelerate the process.
135+
- Some data transformations/augmentations can be included in this step (e.g., data tools provided by TensorFlow and Torch)
132136
133137
Ensemble: |-
134138
1. Input Validation:
@@ -225,4 +229,4 @@ component_spec:
225229
test_preds_dict[model_module.__name__] = test_pred
226230
{% endfor %}
227231
final_pred = ensemble_workflow(test_preds_dict, val_preds_dict, val_y)
228-
{% endraw %}
232+
{% endraw %}

0 commit comments

Comments
 (0)