Skip to content

Commit 0cd250e

Browse files
RolandMinruiXujingyuanlmyou-n-g
authored
feat: enable drafting with knowledge (microsoft#998)
* add pipeline for drafting v2 * fix the pipeline and add general knowledge * debug * fix bug * fix bug * change draft version1 * add function calling to task gen * fix circular import bug * change draft version3 * exp1_test * feat: add DraftRouterExpGen and make summarizer configurable * Update rdagent/scenarios/data_science/proposal/exp_gen/proposal.py * change code structure * stashed changes * test * test1 * revert conf.py * add runtime enviornment info to general knowledge * remove redundant code * clean code * remove files * reformat * fix bug * fix bug * simplify code * fix minor bug * fix bug and reformat * revert config * remove unused prompt * add general knowledge * fix ci --------- Co-authored-by: Xu <v-xuminrui@microsoft.com> Co-authored-by: jingyuanlm <842442862@qq.com> Co-authored-by: Young <afe.young@gmail.com> Co-authored-by: you-n-g <you-n-g@users.noreply.github.com>
1 parent ebfcf31 commit 0cd250e

File tree

12 files changed

+618
-175
lines changed

12 files changed

+618
-175
lines changed

rdagent/app/data_science/conf.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
2121
hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.proposal.DSProposalV2ExpGen"
2222
"""Hypothesis generation class"""
2323

24+
summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback"
25+
summarizer_init_kwargs: dict = {
26+
"version": "exp_feedback",
27+
}
2428
## Workflow Related
2529
consecutive_errors: int = 5
2630

rdagent/scenarios/data_science/dev/coder.py

Whitespace-only changes.

rdagent/scenarios/data_science/dev/feedback.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
ExperimentFeedback,
1010
HypothesisFeedback,
1111
)
12+
from rdagent.core.scenario import Scenario
1213
from rdagent.log.utils import dict_get_with_warning
1314
from rdagent.oai.llm_utils import APIBackend
1415
from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
@@ -20,6 +21,10 @@
2021

2122

2223
class DSExperiment2Feedback(Experiment2Feedback):
24+
def __init__(self, scen: Scenario, version: str = "exp_feedback") -> None:
25+
super().__init__(scen)
26+
self.version = version
27+
2328
def generate_feedback(self, exp: DSExperiment, trace: DSTrace) -> ExperimentFeedback:
2429
# 用哪些信息来生成feedback
2530
# 1. pending_tasks_list[0][0] 任务的描述
@@ -63,10 +68,11 @@ def generate_feedback(self, exp: DSExperiment, trace: DSTrace) -> ExperimentFeed
6368
)
6469

6570
eda_output = exp.experiment_workspace.file_dict.get("EDA.md", None)
66-
system_prompt = T(".prompts:exp_feedback.system").r(
71+
72+
system_prompt = T(f".prompts:{self.version}.system").r(
6773
scenario=self.scen.get_scenario_all_desc(eda_output=eda_output)
6874
)
69-
user_prompt = T(".prompts:exp_feedback.user").r(
75+
user_prompt = T(f".prompts:{self.version}.user").r(
7076
sota_desc=sota_desc,
7177
cur_exp=exp,
7278
diff_edition=diff_edition,

rdagent/scenarios/data_science/dev/prompts.yaml

Lines changed: 36 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ exp_feedback:
8080
8181
user: |-
8282
We are currently in a process of validating hypotheses to iteratively improve our models for Kaggle competitions. Each round aims explicitly to confirm or reject hypotheses based on experiment results.
83-
83+
8484
## SOTA Solution
8585
{{ sota_desc }}
8686
@@ -126,21 +126,22 @@ exp_feedback:
126126
{{ feedback_desc or "There has not been any experiments yet." }}
127127
Please refer to these hypotheses and feedback to help you recommend new experiment and hypothesis
128128
129+
129130
Tips:
130131
- Step 1: If submission format has issues, prioritize fixing them before proceeding. If the format is correct and it's the first valid submission ever (there has never been valid submissions in the past), set `"Replace Best Result": "yes"`. If the format is correct and this is not the first valid submission, proceed to Step 2.
131132
- Step 2: If evaluation alignment issues are identified (validation approach does not follow competition requirements), address these methodological discrepancies immediately.
132133
- Step 3: If new results significantly worse than SOTA, or repeated hyperparameter adjustments yield no improvement, it might be time to rethink or shift focus.
133134
134-
exp_feedback_v3:
135+
exp_feedback_draft:
135136
system: |-
136137
You are an advanced assistant analyzing results in data-driven R&D.
137138
138139
Below is a detailed description of the current Kaggle competition scenario:
139140
{{ scenario }}
140141
141-
Your task is to analyze the current experiment's hypothesis, implementation (code), and results, explicitly comparing them with previous experiments and the best previous result (SOTA).
142+
Your task is to analyze the current experiment's hypothesis, implementation (code and its changes), and results, explicitly comparing them with previous best SOTA result step by step.
142143
143-
Step-by-step Analysis Process:
144+
# Step-by-step Analysis Process:
144145
145146
Step 1: Verify Submission Format
146147
- If the submission format check fails:
@@ -159,9 +160,11 @@ exp_feedback_v3:
159160
- Consistent prediction methodologies between validation and test datasets.
160161
- No shortcuts or fold-specific strategies applied inconsistently.
161162
- Rigorous checks for corner-case consistency.
163+
- If the validation score appears unreliable, provide concrete evidence from the scenario description or code implementation. Do not rely on assumptions without direct supporting evidence.
162164
- Additionally, detect whether the setup introduces structural risks, such as overfitting-prone finetuning strategies or domain adaptation on insufficient data.
165+
- If overfitting is detected, provide a detailed analysis explaining how and why it occurs, referencing scenario description, code implementation, and validation scores to support your findings.
163166
- If such discrepancies or risks are found:
164-
- Clearly document these issues in `Reasoning`.
167+
- Clearly document these issues in `Reasoning`, referencing both scenario description and code implementation—not just validation scores.
165168
- Set `"Evaluation Aligned With Task": "no"` and `"Replace Best Result": "no"`.
166169
- Begin your `reasoning` with `[Evaluation error]`, explicitly stating the evaluation alignment issues causing experiment failure.
167170
- If evaluation alignment passes, set `"Evaluation Aligned With Task": "yes"`, and then proceed to Step 3.
@@ -177,6 +180,7 @@ exp_feedback_v3:
177180
- NOTES:
178181
- The experiments focus on the comparison of the final ensemble results (Don't reject the results because they are still not perfect)
179182
- If the `ensemble` score does not exceed the best individual mode or single fold, it is still acceptable unless the gap is significant.
183+
180184
Step 4: Analyze Code With Similar validation Results
181185
- If the current `ensemble` validation score is similar to the SOTA `ensemble` validation score, give the decision based on the comparison between the current experiment and SOTA.
182186
- The current code should replace the best result if the code is:
@@ -185,23 +189,39 @@ exp_feedback_v3:
185189
- Interpretable and domain alignment. The code should be tied to solid domain knowledge and be interpretable.
186190
- More resource efficiency. The code should be more efficient in terms of time and space complexity.
187191
- Please examine the code carefully based on the above criteria and provide a detailed analysis of the code.
188-
- Begin your `reasoning` with `[Code Analysis]`, clearly stating why the current code is better or worse than SOTA.
192+
- Begin your `reasoning` with `[Code Analysis]`, clearly stating why the current code is better or worse than SOTA, based on the analysis of code implementation.
189193
- If the current code is not better than SOTA, set `"Replace Best Result": "no"`. Otherwise, set `"Replace Best Result": "yes"`.
190-
191-
Provide detailed and constructive feedback structured as follows:
192-
Example JSON Structure for Result Analysis:
194+
195+
Step 5: EDA improvement analysis (if needed)
196+
- The user might provide Data Overview in EDA format which is the output of the EDA code. You should analyze the EDA result and provide feedback on how it can be improved.
197+
- The improvement might include some addons or modifications or deletions to some part of the EDA code.
198+
- You should provide your feedback based on the current code and SOTA code. Especially focus on the feature engineering part.
199+
- For example, if the code truncate the line with N words, you can suggest to print the mean, median or quantile of the length of the line for better understanding of the data in the next rounds of experiments.
200+
201+
Provide detailed and constructive feedback structured as follows without anything else:
193202
{
194203
"Submission Format Check": "yes or no",
195204
"First Valid Submission": "yes or no",
196-
"Observations": "Clearly summarize current and SOTA ensemble results with exact scores and notable patterns. Limit to no more than three concise, data-focused sentences.",
205+
"Code Change Summary": "Clearly summarize the changes made to the code (please cover the most important changes while being concise); during development, extra modifications may be made beyond the intent of the hypothesis, so these changes should also be included to provide complete information",
206+
"Observations": "Clearly summarize current and SOTA ensemble results with exact scores and notable patterns. Limit to no more than three concise, data-focused sentences. Your observation must be grounded by explicit evidence from scenario description or code implementation, not just validation scores.",
197207
"Feedback for Hypothesis": Explicitly confirm or refute the hypothesis based on specific data points or performance trends. Limit to two sentences.",
198208
"Evaluation Aligned With Task": "yes or no",
199209
"Replace Best Result": "yes or no",
200-
"Reasoning": "Clearly explain the reason for success or failure of the experiment. Begin explicitly with [Submission format error], [Evaluation error], [Experiment Analysis] or [Code Analysis] depending on the step at which issues arose. Reference specific scores and methodological differences with SOTA. Limit to three sentences."
210+
"Refine Decision": "yes or no",
211+
"Reasoning": "Clearly explain the reason for success or failure of the experiment. Begin explicitly with [Submission format error], [Evaluation error], [Experiment Analysis] or [Code Analysis] depending on the step at which issues arose. Reference specific scores and methodological differences with SOTA. Limit to three sentences.",
212+
"EDA Improvement": "improvement suggestion for EDA code, if needed, otherwise set to 'no'. If there is no EDA code, set to 'no'."
201213
}
202214
203215
user: |-
204216
We are currently in a process of validating hypotheses to iteratively improve our models for Kaggle competitions. Each round aims explicitly to confirm or reject hypotheses based on experiment results.
217+
We prioritize minimal, incremental code changes that lead to measurable improvements.**
218+
- Once a pipeline can run end-to-end and produce valid outputs with reasonable validation results, **future iterations should avoid large-scale rewrites**.
219+
- Instead, apply **small, controlled changes** to gradually improve performance. Examples include:
220+
- Increasing `max_epoch` or adjusting early stopping to allow better convergence.
221+
- Slightly modifying model architecture (e.g., unfreezing layers, switching backbone).
222+
- Tuning hyperparameters like learning rate, batch size, or dropout.
223+
- Introducing one new augmentation or feature at a time.
224+
- This approach ensures that each change is **testable**, **traceable**, and **reversible**, and it avoids the risk of silently breaking a previously working pipeline.
205225
206226
## SOTA Solution
207227
{{ sota_desc }}
@@ -227,8 +247,9 @@ exp_feedback_v3:
227247
1. Pay close attention to the `ensemble` score, as it represents the final evaluation metric for this iteration.
228248
2. If any individual model significantly outperforms the ensemble, this may indicate an issue in the ensemble method. But if the final `ensemble` score surpasses the current SOTA, you should update the SOTA record. However, it seems that there are noticeable issues in the ensemble component, be sure to highlight them explicitly.
229249
230-
Below are the results for this experiment:
231-
{{ cur_exp.result }}
250+
Below are the results and running time for this experiment:
251+
Running time: {{ cur_exp.running_info.running_time }} seconds.
252+
Results: {{ cur_exp.result }}
232253
233254
{% if cur_vs_sota_score is not none %}
234255
Below is the comparison of the current `ensemble` performance with the SOTA results:
@@ -247,7 +268,9 @@ exp_feedback_v3:
247268
{{ feedback_desc or "There has not been any experiments yet." }}
248269
Please refer to these hypotheses and feedback to help you recommend new experiment and hypothesis
249270
271+
250272
Tips:
251273
- Step 1: If submission format has issues, prioritize fixing them before proceeding. If the format is correct and it's the first valid submission ever (there has never been valid submissions in the past), set `"Replace Best Result": "yes"`. If the format is correct and this is not the first valid submission, proceed to Step 2.
252274
- Step 2: If evaluation alignment issues are identified (validation approach does not follow competition requirements), address these methodological discrepancies immediately.
253275
- Step 3: If new results significantly worse than SOTA, or repeated hyperparameter adjustments yield no improvement, it might be time to rethink or shift focus.
276+
- Step 4: If the result is only slightly better than the SOTA, but the code modifications are extensive (e.g., low modification score or too many critical changes), reject the update. Prefer small-step improvements with minimal changes. Set `"Replace Best Result": "no"` and explain in `"Reasoning"` starting with `[Code Change Too Large]`.

rdagent/scenarios/data_science/loop.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
from rdagent.scenarios.data_science.dev.feedback import DSExperiment2Feedback
3131
from rdagent.scenarios.data_science.dev.runner import DSCoSTEERRunner
3232
from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
33-
from rdagent.scenarios.data_science.proposal.exp_gen import DSExpGen, DSTrace
33+
from rdagent.scenarios.data_science.proposal.exp_gen import DSTrace
3434
from rdagent.scenarios.data_science.proposal.exp_gen.idea_pool import DSKnowledgeBase
3535
from rdagent.scenarios.data_science.proposal.exp_gen.proposal import DSProposalV2ExpGen
3636
from rdagent.utils.workflow.misc import wait_retry
@@ -112,8 +112,6 @@ def __init__(self, PROP_SETTING: BasePropSetting):
112112
self.runner = DSCoSTEERRunner(scen)
113113
if DS_RD_SETTING.enable_doc_dev:
114114
self.docdev = DocDev(scen)
115-
# self.summarizer: Experiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
116-
# logger.log_object(self.summarizer, tag="summarizer")
117115

118116
if DS_RD_SETTING.enable_knowledge_base and DS_RD_SETTING.knowledge_base_version == "v1":
119117
knowledge_base = DSKnowledgeBase(
@@ -122,7 +120,9 @@ def __init__(self, PROP_SETTING: BasePropSetting):
122120
self.trace = DSTrace(scen=scen, knowledge_base=knowledge_base)
123121
else:
124122
self.trace = DSTrace(scen=scen)
125-
self.summarizer = DSExperiment2Feedback(scen)
123+
124+
self.summarizer = import_class(PROP_SETTING.summarizer)(scen=scen, **PROP_SETTING.summarizer_init_kwargs)
125+
126126
super(RDLoop, self).__init__()
127127

128128
async def direct_exp_gen(self, prev_out: dict[str, Any]):
Lines changed: 2 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,3 @@
1-
from rdagent.app.data_science.conf import DS_RD_SETTING
2-
from rdagent.core.proposal import ExpGen
3-
from rdagent.core.scenario import Scenario
4-
from rdagent.log import rdagent_logger as logger
5-
from rdagent.oai.llm_utils import APIBackend, md5_hash
6-
from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
7-
from rdagent.scenarios.data_science.proposal.exp_gen.base import DSHypothesis, DSTrace
8-
from rdagent.scenarios.data_science.proposal.exp_gen.draft import DSDraftExpGen
9-
from rdagent.scenarios.data_science.proposal.exp_gen.proposal import (
10-
DSProposalV1ExpGen,
11-
DSProposalV2ExpGen,
12-
)
13-
from rdagent.scenarios.data_science.scen import DataScienceScen
14-
from rdagent.utils.agent.tpl import T
1+
from rdagent.scenarios.data_science.proposal.exp_gen.base import DSTrace
152

16-
17-
class DSExpGen(ExpGen):
18-
"""
19-
Data Science Task Generator.
20-
This is a experiment router generator;
21-
"""
22-
23-
def __init__(self, *args, **kwargs):
24-
super().__init__(*args, **kwargs)
25-
26-
def gen(self, trace: DSTrace) -> DSExperiment:
27-
# sota_exp = trace.sota_experiment()
28-
29-
# # Draft
30-
# # TODO: draft here
31-
# if sota_exp is None:
32-
# pass
33-
34-
# Propose
35-
if DS_RD_SETTING.proposal_version == "v1":
36-
return DSProposalV1ExpGen(scen=self.scen).gen(trace=trace)
37-
if DS_RD_SETTING.proposal_version == "v2":
38-
return DSProposalV2ExpGen(scen=self.scen).gen(trace=trace)
3+
__all__ = ["DSTrace"]

0 commit comments

Comments
 (0)