feat: add LLM-finetune scenario by XianBW · Pull Request #1314 · microsoft/RD-Agent

XianBW · 2025-12-15T07:05:38Z

Description

Motivation and Context

How Has This Been Tested?

If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

Your own tests:

Types of changes

Fix bugs
Add new feature
Update documentation

📚 Documentation preview 📚: https://RDAgent--1314.org.readthedocs.build/en/1314/

* feat: add iterative evolve and evaluation support with partial chain stop * feat: add FTDataEvaluator and support multiple implement functions in finetune

…1303) * feat:(1) support for multi layer dataset extraction (2) add category.json for dataset in datasets/ * fix: fix bug for generate category.json * feat: add get_dataset_folder_desc * init data proposal and merge qzli/ft * update data proposal prompts and add max_position_embeddings and resolve confilcts * remove sample counts in data proposal * turn data and train to unified hypo_gen * refine prompts * remove category.json and add it to dataset_info * fix jinja problem and proposal done * lint * add ai-generated description and raw readme into dataset_info.json * update prompt for description * add datasets * initial fix for proposal of data * final version for data proposal * lint

* refactor(dataset): add stats into dataset_info.json, and remove dataset from gitignore_folder * feat: enable data coder and run data process

* feat: implement finetune data coding, evaluation, and config improvements * fix: deepspeed config path * fix: dataset info columns --------- Co-authored-by: Young <afe.young@gmail.com>

… description

…soning token limits

…re download

Copilot

Pull request overview

This PR introduces a comprehensive LLM fine-tuning system to RDAgent, adding support for automated fine-tuning experiments with benchmarking, dataset management, and evaluation pipelines. The changes span ~15,000+ lines across multiple modules including core framework modifications, scenario implementation, benchmarking infrastructure, and UI components.

Changes:

Added LLM fine-tuning scenario with training pipeline, benchmark evaluation, and dataset management
Modified core framework to support iterative evaluation and evolving strategies
Added Docker environments for training (LLaMA-Factory) and benchmarking (OpenCompass)
Implemented UI for monitoring fine-tuning jobs and experiments
Extended configuration system with fine-tuning specific settings

Reviewed changes

Copilot reviewed 115 out of 121 changed files in this pull request and generated 56 comments.

Show a summary per file

File	Description
rdagent/core/experiment.py	Changed stdout handling from truncated to full output
rdagent/core/evaluation.py	Made evaluate() method optional instead of abstract
rdagent/core/evolving_framework.py	Added iterative evaluation support
rdagent/core/evolving_agent.py	Implemented RAGEvaluator with evaluate_iter
rdagent/core/proposal.py	Added SOTA tracking and DAG parent synchronization
rdagent/core/exception.py	Added CodeBlockParseError for extraction failures
rdagent/utils/workflow/loop.py	Added skip_loop_error_stepname for error recovery
rdagent/components/coder/CoSTEER/*	Extended with iterative evolving and evaluation
rdagent/scenarios/finetune/*	Complete fine-tuning scenario implementation
rdagent/oai/backend/*	Enhanced code block parsing and token counting
test/*	Added test files for fine-tuning components

Comments suppressed due to low confidence (1)

rdagent/scenarios/data_science/dev/runner/eval.py:91

This assignment to 'stdout' is unnecessary as it is redefined before this value is used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T03:01:32Z

rdagent/core/experiment.py

        """
        result = self.run(env, entry)
-        return result.get_truncated_stdout()  # NOTE: truncating just for aligning with the old code.
+        return result.stdout  # NOTE: truncating just for aligning with the old code.


The get_truncated_stdout() method calls have been replaced with direct stdout access. Verify that the stdout attribute contains the full output or is appropriately handled in all calling code, as this changes the behavior from truncated to full output which may cause issues with very large outputs.

Copilot · 2026-02-06T03:01:33Z

rdagent/components/coder/CoSTEER/evolving_strategy.py

+    def assign_code_list_to_evo(self, code_list: list[dict | None], evo) -> None:
+        """Assign code modifications to evolving item.
+
+        For runner, coder already generated full training config, so typically no modifications.
+        But this method is required by the abstract base class.
+        """
+        for index in range(len(evo.sub_tasks)):
+            if code_list[index] is None:
+                continue
+            if evo.sub_workspace_list[index] is None:
+                evo.sub_workspace_list[index] = evo.experiment_workspace
+
+            # If there are any modifications (usually empty for runner)
+            if code_list[index]:
+                # Handle change summary if present
+                if self.KEY_CHANGE_SUMMARY in code_list[index]:
+                    evo.sub_workspace_list[index].change_summary = code_list[index].pop(self.KEY_CHANGE_SUMMARY)
+                # Inject any modified files
+                evo.sub_workspace_list[index].inject_files(**code_list[index])
+
+        return evo


The duplicate method definition assign_code_list_to_evo at line 86 will override the abstract method at line 75. This appears to be a concrete implementation that should either replace the abstract method or be renamed. The duplicate definition will cause the abstract method to be shadowed.

Copilot · 2026-02-06T03:01:33Z

rdagent/utils/workflow/loop.py

+                        if self.skip_loop_error_stepname:
+                            next_step_idx = self.steps.index(self.skip_loop_error_stepname)
+                            if next_step_idx <= si:
+                                raise RuntimeError(
+                                    f"Cannot skip backwards or to same step. Current: {si} ({name}), Target: {next_step_idx} ({self.skip_loop_error_stepname})"
+                                ) from e


In the skip_loop_error handling, when skip_loop_error_stepname is provided, the code raises a RuntimeError if the target step is before or at the current step. However, this exception is raised using from e, which chains it with the original skip_loop_error exception. This might lead to confusion about which exception caused the failure. Consider whether this is the intended behavior or if a separate exception type would be clearer.

Copilot · 2026-02-06T03:01:33Z

rdagent/utils/env.py

+        # for path in Path(local_path).rglob("*"):
+        #     p = str(path.relative_to(Path(local_path)))
+        #     if p.startswith("__pycache__"):
+        #         continue
+        #     data_key.append(p)


This comment appears to contain commented-out code.

Copilot · 2026-02-06T03:01:33Z

rdagent/scenarios/finetune/scen/utils.py

+                # if entry.name.lower() in {"readme.md", "readme.txt"}:
+                #     results.append(entry)


This comment appears to contain commented-out code.

Copilot · 2026-02-06T03:01:44Z

rdagent/scenarios/finetune/benchmark/benchmark.py

+        dict: Merged configuration (model-specific overrides default)
+              Uses exact match first, then longest prefix match, finally default only.
+    """
+    config_data = yaml.safe_load(open(Path(__file__).parent / "configs" / "models.yaml", "r"))


File is opened but is not closed.

Copilot · 2026-02-06T03:01:45Z

rdagent/scenarios/finetune/env/conda/opencompass_requirements.txt

+vllm>=0.12.0
+
+# OpenCompass benchmark framework (custom fork with cascade eval support)
+opencompass @ git+https://github.com/Jensen246/opencompass.git


The conda requirements file installs opencompass directly from a mutable GitHub URL (opencompass @ git+https://github.com/Jensen246/opencompass.git), so each environment build can pull and execute different code over time from that external repo. If the upstream repository or its default branch is compromised, attackers can introduce malicious code into your evaluation environment and potentially exfiltrate credentials (e.g., HF tokens) used there. Prefer pinning this VCS dependency to a specific commit SHA or signed release artifact, or mirroring it to a controlled internal registry.

Copilot · 2026-02-06T03:01:45Z

rdagent/scenarios/finetune/env/docker/llm_finetune/Dockerfile

@@ -0,0 +1,20 @@
+FROM hiyouga/llamafactory:0.9.4


The Dockerfile bases the fine-tuning environment on the third-party image hiyouga/llamafactory:0.9.4 referenced only by a mutable tag, which is a single point of supply-chain trust for all subsequent workloads that may handle API keys or training data. If that image tag is ever replaced or compromised in the upstream registry, builds will silently pull a tampered image and execute attacker-controlled code. To mitigate this, pin the base image to a specific immutable digest (and/or mirror it to a trusted internal registry) so builds are reproducible and resilient to upstream tag hijacking.

Copilot · 2026-02-06T03:01:45Z

rdagent/utils/blob/gen_token.sh

+BLOB_URL="https://${ACCOUNT}.blob.core.windows.net/${CONTAINER}/${REMOTE_PATH}?${TOKEN}"
+echo "Full Blob URL:"
+echo "$BLOB_URL"


gen_token.sh prints the full Azure Blob SAS URL including the TOKEN to stdout (echo "$BLOB_URL"), which can leak write-enabled storage credentials into shell history or centralized logs. Anyone with access to these logs could reuse the SAS URL to read, write or delete blob data for the configured container/path. To reduce exposure, avoid echoing the full SAS token/URL (or gate it behind an explicit debug flag) and ensure tokens are only written to controlled files or displayed interactively when absolutely necessary.

Copilot · 2026-02-06T03:01:45Z

rdagent/scenarios/finetune/env/docker/opencompass/Dockerfile

+RUN git clone https://github.com/Jensen246/opencompass.git /opencompass
+WORKDIR /opencompass
+
+RUN pip install ".[vllm]" --no-cache-dir


The Dockerfile clones https://github.com/Jensen246/opencompass.git at the default branch and immediately runs pip install ".[vllm]", meaning image builds always execute mutable third-party code fetched directly from GitHub. If that repository or its default branch is compromised, an attacker can inject arbitrary code into your build and runtime environment with access to any secrets mounted into the container. To harden the supply chain, pin the dependency to an immutable reference (tagged release or commit SHA) and, if possible, vendor or mirror the code under tighter control.

When skip_loop_error exception happens and skip_loop_error_stepname is not explicitly set, default to jumping to 'feedback' step if it exists, otherwise fall back to the last step (record). This prevents KeyError when record step tries to access feedback data that doesn't exist because we skipped the feedback phase. Also removed redundant skip_loop_error_stepname from finetune loop since it's now the default behavior.

peteryang1 · 2026-02-28T03:11:35Z

how about merge this PR？ @Jensen246

XianBW · 2026-02-28T03:14:03Z

how about merge this PR？ @Jensen246

Only data science scenario has a bug to fix.

peteryang1 and others added 30 commits November 25, 2025 08:54

refine prompt

dc2e96a

Merge branch 'main' into qzli/ft

eb613cd

small update

7d2b64b

fix a small bug

3574238

remove debug config after execution

2056c0b

fix: only remove <think> at start

0979827

feat: support creating dataset & multi-eval frame (#1302)

1f2ca73

* feat: add iterative evolve and evaluation support with partial chain stop * feat: add FTDataEvaluator and support multiple implement functions in finetune

feat: add stats in dataset_info, and enable data coder (#1306)

e489d9c

* refactor(dataset): add stats into dataset_info.json, and remove dataset from gitignore_folder * feat: enable data coder and run data process

feat: Merge data coder (#1307)

e104f50

* feat: implement finetune data coding, evaluation, and config improvements * fix: deepspeed config path * fix: dataset info columns --------- Co-authored-by: Young <afe.young@gmail.com>

replace str length with token_limit

5b7dc33

add readme to dataset_info and remove useless blank lines in scenario…

41fc3c5

… description

feat: dataset prepare

a7e2734

fix: extract prams script name

c4d59b5

feat: add loss&predictions samples to feedback

a5e306c

remove duplicate envs and and add llm_api_preferences and enhance rea…

24d4f2c

…soning token limits

feat: network for ft_env

e9574e6

fix: remove gpt-4o, which has low quota

9c68d3c

feat: a simple ui

324fac2

feat: merge data and train task type (#1309)

c89fab9

feat: filter redundant prams of lf

20bd353

fix: ui bug caused by removing task_type

e921e8a

fix: force agent to use high concurrency, and remove redundant prompt

6ca6f7a

feat: extract info from llama factory log, and check data exists befo…

6042eca

…re download

fix: add compatibility rules

ddeb8b4

feat: llm evaluator for data coder

6a0290a

feat: openai package in ft docker, and refine prompt

d873181

feat: refine ft ui, add more info

c17f8ae

feat: add raw logs

8d5a4c3

refine data coder prompt(for feedback debug)

54c2b71

XianBW marked this pull request as ready for review February 6, 2026 02:55

Copilot AI review requested due to automatic review settings February 6, 2026 02:55

Copilot started reviewing on behalf of XianBW February 6, 2026 02:55 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

SunsetWolf added 6 commits February 6, 2026 09:29

chore(ci): fix mypy typing issues

7f2e4ea

chore(ci): format code with black

20e21a4

chore(ci): fix ruff lint violations

6b3426a

chore(ci): sort imports with isort

8feb0c0

chore(ci): format code with black

3daf9e2

test: temporarily skip extract_parameters imports due to numpy pin

f32281a

XianBW changed the title ~~RDAgent Finetune LLM~~ feat: add LLM-finetune scenario Feb 12, 2026

Jensen246 force-pushed the finetune branch 3 times, most recently from cbb6281 to f986ad5 Compare February 13, 2026 05:45

fix: compatibility issues for qlib scenarios on finetune branch

73fadf7

Jensen246 force-pushed the finetune branch from f986ad5 to 73fadf7 Compare February 13, 2026 05:51

Jensen246 added 2 commits February 13, 2026 06:17

fix(fin_factor): skip to fb for coder error

7b04c45

Jensen246 force-pushed the finetune branch from 7b04c45 to 3882930 Compare February 14, 2026 15:42

XianBW and others added 6 commits February 24, 2026 02:27

Merge branch 'finetune' of github.com:microsoft/RD-Agent into finetune

ffe27cb

add 'skip to record' to DS scenario like other scenarios

1f16018

fix 2 scenarios bug about rd_loop class

89f0bfe

fix: lint(mypy, ruff, black) error

8271d4c

Merge branch 'main' of github.com:microsoft/RD-Agent into finetune

b20e9e0

fix: mypy lint error

b90023c

fix data science scenario bug

61a9474

Jensen246 merged commit 1824e1c into main Mar 2, 2026
10 checks passed

Jensen246 deleted the finetune branch March 2, 2026 11:04

		# if entry.name.lower() in {"readme.md", "readme.txt"}:
		# results.append(entry)

Uh oh!

Conversation

XianBW commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

peteryang1 commented Feb 28, 2026

Uh oh!

XianBW commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

XianBW commented Dec 15, 2025 •

edited

Loading