Conversation
* feat: add iterative evolve and evaluation support with partial chain stop * feat: add FTDataEvaluator and support multiple implement functions in finetune
…1303) * feat:(1) support for multi layer dataset extraction (2) add category.json for dataset in datasets/ * fix: fix bug for generate category.json * feat: add get_dataset_folder_desc * init data proposal and merge qzli/ft * update data proposal prompts and add max_position_embeddings and resolve confilcts * remove sample counts in data proposal * turn data and train to unified hypo_gen * refine prompts * remove category.json and add it to dataset_info * fix jinja problem and proposal done * lint * add ai-generated description and raw readme into dataset_info.json * update prompt for description * add datasets * initial fix for proposal of data * final version for data proposal * lint
* refactor(dataset): add stats into dataset_info.json, and remove dataset from gitignore_folder * feat: enable data coder and run data process
* feat: implement finetune data coding, evaluation, and config improvements * fix: deepspeed config path * fix: dataset info columns --------- Co-authored-by: Young <afe.young@gmail.com>
…soning token limits
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive LLM fine-tuning system to RDAgent, adding support for automated fine-tuning experiments with benchmarking, dataset management, and evaluation pipelines. The changes span ~15,000+ lines across multiple modules including core framework modifications, scenario implementation, benchmarking infrastructure, and UI components.
Changes:
- Added LLM fine-tuning scenario with training pipeline, benchmark evaluation, and dataset management
- Modified core framework to support iterative evaluation and evolving strategies
- Added Docker environments for training (LLaMA-Factory) and benchmarking (OpenCompass)
- Implemented UI for monitoring fine-tuning jobs and experiments
- Extended configuration system with fine-tuning specific settings
Reviewed changes
Copilot reviewed 115 out of 121 changed files in this pull request and generated 56 comments.
Show a summary per file
| File | Description |
|---|---|
| rdagent/core/experiment.py | Changed stdout handling from truncated to full output |
| rdagent/core/evaluation.py | Made evaluate() method optional instead of abstract |
| rdagent/core/evolving_framework.py | Added iterative evaluation support |
| rdagent/core/evolving_agent.py | Implemented RAGEvaluator with evaluate_iter |
| rdagent/core/proposal.py | Added SOTA tracking and DAG parent synchronization |
| rdagent/core/exception.py | Added CodeBlockParseError for extraction failures |
| rdagent/utils/workflow/loop.py | Added skip_loop_error_stepname for error recovery |
| rdagent/components/coder/CoSTEER/* | Extended with iterative evolving and evaluation |
| rdagent/scenarios/finetune/* | Complete fine-tuning scenario implementation |
| rdagent/oai/backend/* | Enhanced code block parsing and token counting |
| test/* | Added test files for fine-tuning components |
Comments suppressed due to low confidence (1)
rdagent/scenarios/data_science/dev/runner/eval.py:91
- This assignment to 'stdout' is unnecessary as it is redefined before this value is used.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| result = self.run(env, entry) | ||
| return result.get_truncated_stdout() # NOTE: truncating just for aligning with the old code. | ||
| return result.stdout # NOTE: truncating just for aligning with the old code. |
There was a problem hiding this comment.
The get_truncated_stdout() method calls have been replaced with direct stdout access. Verify that the stdout attribute contains the full output or is appropriately handled in all calling code, as this changes the behavior from truncated to full output which may cause issues with very large outputs.
| def assign_code_list_to_evo(self, code_list: list[dict | None], evo) -> None: | ||
| """Assign code modifications to evolving item. | ||
| For runner, coder already generated full training config, so typically no modifications. | ||
| But this method is required by the abstract base class. | ||
| """ | ||
| for index in range(len(evo.sub_tasks)): | ||
| if code_list[index] is None: | ||
| continue | ||
| if evo.sub_workspace_list[index] is None: | ||
| evo.sub_workspace_list[index] = evo.experiment_workspace | ||
|
|
||
| # If there are any modifications (usually empty for runner) | ||
| if code_list[index]: | ||
| # Handle change summary if present | ||
| if self.KEY_CHANGE_SUMMARY in code_list[index]: | ||
| evo.sub_workspace_list[index].change_summary = code_list[index].pop(self.KEY_CHANGE_SUMMARY) | ||
| # Inject any modified files | ||
| evo.sub_workspace_list[index].inject_files(**code_list[index]) | ||
|
|
||
| return evo |
There was a problem hiding this comment.
The duplicate method definition assign_code_list_to_evo at line 86 will override the abstract method at line 75. This appears to be a concrete implementation that should either replace the abstract method or be renamed. The duplicate definition will cause the abstract method to be shadowed.
| if self.skip_loop_error_stepname: | ||
| next_step_idx = self.steps.index(self.skip_loop_error_stepname) | ||
| if next_step_idx <= si: | ||
| raise RuntimeError( | ||
| f"Cannot skip backwards or to same step. Current: {si} ({name}), Target: {next_step_idx} ({self.skip_loop_error_stepname})" | ||
| ) from e |
There was a problem hiding this comment.
In the skip_loop_error handling, when skip_loop_error_stepname is provided, the code raises a RuntimeError if the target step is before or at the current step. However, this exception is raised using from e, which chains it with the original skip_loop_error exception. This might lead to confusion about which exception caused the failure. Consider whether this is the intended behavior or if a separate exception type would be clearer.
| # for path in Path(local_path).rglob("*"): | ||
| # p = str(path.relative_to(Path(local_path))) | ||
| # if p.startswith("__pycache__"): | ||
| # continue | ||
| # data_key.append(p) |
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| # if entry.name.lower() in {"readme.md", "readme.txt"}: | ||
| # results.append(entry) |
There was a problem hiding this comment.
This comment appears to contain commented-out code.
| dict: Merged configuration (model-specific overrides default) | ||
| Uses exact match first, then longest prefix match, finally default only. | ||
| """ | ||
| config_data = yaml.safe_load(open(Path(__file__).parent / "configs" / "models.yaml", "r")) |
There was a problem hiding this comment.
File is opened but is not closed.
| vllm>=0.12.0 | ||
|
|
||
| # OpenCompass benchmark framework (custom fork with cascade eval support) | ||
| opencompass @ git+https://github.com/Jensen246/opencompass.git |
There was a problem hiding this comment.
The conda requirements file installs opencompass directly from a mutable GitHub URL (opencompass @ git+https://github.com/Jensen246/opencompass.git), so each environment build can pull and execute different code over time from that external repo. If the upstream repository or its default branch is compromised, attackers can introduce malicious code into your evaluation environment and potentially exfiltrate credentials (e.g., HF tokens) used there. Prefer pinning this VCS dependency to a specific commit SHA or signed release artifact, or mirroring it to a controlled internal registry.
| @@ -0,0 +1,20 @@ | |||
| FROM hiyouga/llamafactory:0.9.4 | |||
There was a problem hiding this comment.
The Dockerfile bases the fine-tuning environment on the third-party image hiyouga/llamafactory:0.9.4 referenced only by a mutable tag, which is a single point of supply-chain trust for all subsequent workloads that may handle API keys or training data. If that image tag is ever replaced or compromised in the upstream registry, builds will silently pull a tampered image and execute attacker-controlled code. To mitigate this, pin the base image to a specific immutable digest (and/or mirror it to a trusted internal registry) so builds are reproducible and resilient to upstream tag hijacking.
| BLOB_URL="https://${ACCOUNT}.blob.core.windows.net/${CONTAINER}/${REMOTE_PATH}?${TOKEN}" | ||
| echo "Full Blob URL:" | ||
| echo "$BLOB_URL" |
There was a problem hiding this comment.
gen_token.sh prints the full Azure Blob SAS URL including the TOKEN to stdout (echo "$BLOB_URL"), which can leak write-enabled storage credentials into shell history or centralized logs. Anyone with access to these logs could reuse the SAS URL to read, write or delete blob data for the configured container/path. To reduce exposure, avoid echoing the full SAS token/URL (or gate it behind an explicit debug flag) and ensure tokens are only written to controlled files or displayed interactively when absolutely necessary.
| RUN git clone https://github.com/Jensen246/opencompass.git /opencompass | ||
| WORKDIR /opencompass | ||
|
|
||
| RUN pip install ".[vllm]" --no-cache-dir |
There was a problem hiding this comment.
The Dockerfile clones https://github.com/Jensen246/opencompass.git at the default branch and immediately runs pip install ".[vllm]", meaning image builds always execute mutable third-party code fetched directly from GitHub. If that repository or its default branch is compromised, an attacker can inject arbitrary code into your build and runtime environment with access to any secrets mounted into the container. To harden the supply chain, pin the dependency to an immutable reference (tagged release or commit SHA) and, if possible, vendor or mirror the code under tighter control.
cbb6281 to
f986ad5
Compare
When skip_loop_error exception happens and skip_loop_error_stepname is not explicitly set, default to jumping to 'feedback' step if it exists, otherwise fall back to the last step (record). This prevents KeyError when record step tries to access feedback data that doesn't exist because we skipped the feedback phase. Also removed redundant skip_loop_error_stepname from finetune loop since it's now the default behavior.
|
how about merge this PR? @Jensen246 |
Only data science scenario has a bug to fix. |
Description
Motivation and Context
How Has This Been Tested?
Screenshots of Test Results (if appropriate):
Types of changes
📚 Documentation preview 📚: https://RDAgent--1314.org.readthedocs.build/en/1314/