add DPA-ADAPT toolkit for downstream property adaptation by zhaiwenxi · Pull Request #5572 · deepmodeling/deepmd-kit

zhaiwenxi · 2026-06-22T11:27:26Z

Summary

This PR adds DPA-ADAPT, a toolkit for adapting pretrained DPA models to downstream atomistic property prediction tasks.

The new package provides a scikit-learn-style Python API and standalone CLI for fine-tuning, descriptor extraction, prediction, evaluation, cross-validation, and data preparation, without requiring users to manually write DeePMD-kit training input files.

Main changes

Add the top-level dpa_adapt Python package.
Add standalone CLI entry points:
- dpa-adapt
- dpaad
Support multiple adaptation strategies:
- frozen_sklearn: frozen DPA descriptors with scikit-learn regressors
- frozen_head: train a property head on top of a frozen DPA backbone
- finetune: end-to-end DPA fine-tuning
- mft: multi-task fine-tuning with auxiliary energy/force training
Add data utilities for:
- DeepMD/npy loading and validation
- label attachment
- descriptor caching
- train/test split and cross-validation
- SMILES/formula-based conversion workflows
- optional frame parameters via fparam.npy
Add prediction and evaluation helpers with MAE, RMSE, and R2 reporting.
Add documentation under doc/dpa_adapt/.
Add a runnable QM9 HOMO-LUMO gap example under examples/dpa_adapt/.
Add dpa-adapt optional dependencies in pyproject.toml.
Add dedicated lightweight CI for source/tests/dpa_adapt/.

Co-authored-by: zirenjin <zirenjin@umich.edu>

Summary by CodeRabbit

New Features
- Added the DPA-ADAPT toolkit with a new command-line interface for data conversion, validation, training, prediction, evaluation, and descriptor extraction.
- Introduced support for multiple adaptation workflows, including frozen-sklearn, frozen-head, fine-tuning, and multi-task training.
- Added data handling for SMILES, formulas, structures, label attachment, and condition features.
- Included a new example workflow and expanded user documentation for setup and usage.

for more information, see https://pre-commit.ci

feat: add DeePMD property tools

for more information, see https://pre-commit.ci

Add property tools

… leak)

dpa_tools merge

…re paths

…t, unify --target-key

…t→convert)

…_path

…utput parsing - DPAFineTuner: extract _FrozenSklearnPipeline helper; keep public API unchanged - MFTFineTuner: defer _read_fitting_net_from_ckpt to first access - DPATrainer._parse_test_output: single anchored regex per metric, auto-detect format

…perty metrics - _load_labels: accept str | list[str], stack columns for multi-property - build_sklearn_head: n_outputs param, wrap RF/Ridge with MultiOutputRegressor - evaluate: per-property mae/rmse/r2 dict when target_key is a list - freeze/DPAPredictor: store and load target_key as-is (str or list) - CLI: --target-key homo,lumo parsed via _maybe_split_list - 6 new tests covering fit, evaluate, freeze/load round-trip

The old _load_descriptor_model, _validate_type_map, _remap_atom_types, _extract_features_cached, and _extract_features method bodies were left in place alongside the new thin delegators, causing CodeQL 'variable defined multiple times' warnings. Removed the old bodies; kept _extract_features_cached on DPAFineTuner directly so that test patches on DPAFineTuner._extract_features are honoured through the cache wrapper.

… method - Replace try/except ImportError in _unwrap_multioutput with direct import (sklearn is always available when dpa_tools is loaded) - Remove _FrozenSklearnPipeline.extract_features_cached (dead code; the caching wrapper lives on DPAFineTuner so test patches work)

The workflow still referenced the deleted deepmd_property_tools/ directory. Updated paths trigger to deepmd/dpa_tools/** and test command to source/tests/dpa_tools/. Added torch to lightweight dependencies.

numpy 2.3+ requires Python>=3.11, but the property_tools_tests workflow runs on Python 3.10. Pin numpy>=1.21,<2.2 to keep the lightweight dependency install working on older Python.

refactor: unify dpa_tools CLI/API and merge deepmd_property_tools

Remove the `--fmt formula` pipeline that converted elemental composition formulas plus a template POSCAR into deepmd/npy systems via random atomic substitution on the host-element sublattice. - delete dpa_adapt/data/formula.py and source/tests/dpa_adapt/test_formula.py - drop the formula_to_npy exports and the fmt="formula" branch in convert() - remove the --poscar/--base-element/--formula-col/--sets CLI flags and the formula result handler from the data convert command - prune formula tests from test_convert.py and test_cli_smoke.py - drop the Formula Tables docs from the README and dpa_adapt guide The unrelated cross-validation group_by="formula" grouping is unchanged.

iProzd

Others LGTM.

refactor(dpa-adapt): remove formula-table conversion feature

njzjz-bot

Thanks for the large DPA-ADAPT contribution. I found several correctness / integration issues that should be fixed before merge, so I’m requesting changes.

DPA-ADAPT tests require scikit-learn, but the test extra does not install it.

The new tests live under source/tests/dpa_adapt, so they are included by the normal pytest source/tests paths. Some of them import DPA-ADAPT modules that import sklearn at module import time (source/tests/dpa_adapt/test_split_cv.py → dpa_adapt/cv.py). The CPU workflow currently masks this by manually adding scikit-learn, but other test paths do not: CUDA installs .[gpu,test,lmp,cu12,torch,jax] and runs python -m pytest source/tests, and the default tox env installs only test,cpu. Those will fail with ModuleNotFoundError: sklearn unless the dependency is added to the appropriate test/CI extras or the optional tests skip cleanly. Prefer installing/testing the advertised dpa-adapt extra in CI so the package metadata is validated too.
The descriptor cache key can return stale descriptors for changed systems.

dpa_adapt/data/desc_cache.py builds _system_fingerprint() from metadata plus only the first/last 64 flattened entries of coords and cells. Any change in the middle of a larger trajectory/system keeps the same cache key, so both the aggregate cache and per-system cache can reuse descriptors from a different structure. Since these descriptors directly drive fitting and prediction, the cache identity needs to cover the full relevant array contents (for example a streaming hash over all arrays, or a robust hash of all set.*/*.npy inputs) before this is safe.
Unsupported training elements are silently ignored in type-map resolution.

In dpa_adapt/finetuner.py, _resolve_type_maps() catches every ValueError from both read_data_type_map_union() and validate_type_map_subset(). That means a real validation failure such as data containing an element not covered by the checkpoint is swallowed as if it were merely the “no atom_names” case, and training proceeds with an incompatible checkpoint type map. Please split the try block so only the missing-atom-names case is ignored, and unsupported-element validation errors propagate clearly.
MFT rejects valid raw-index data without type_map.raw.

read_data_type_map_union() treats dpdata placeholder names such as Type_0/Type_1 as real elements. MFT then validates those against the checkpoint type map and fails, although the surrounding code comments say data without atom_names should be allowed to use raw atom indices. This is inconsistent with finetuner._read_data_type_map(), which explicitly filters all-Type_* placeholder maps. Please make the shared type-map reader ignore all-placeholder maps, or reuse the same filtering logic before subset validation.
load_dataset() does not use the direct custom-label fallback used elsewhere.

dpa_adapt/data/dataset.py only checks resolved_key in system.data. Custom labels such as property.npy, homo.npy, or bandgap.npy under set.*/ are not generally loaded into dpdata.System.data; _load_labels() in finetuner.py already has a direct set.*/{key}.npy fallback for this. Since the CV CLI calls load_dataset(args.data, label_key=...), valid datasets with these custom label files can be skipped as missing. Please share/reuse the same label-discovery semantics here.

A smaller consistency issue: DPAFineTuner.__init__ should validate fparam_dim the same way DPATrainer and MFTFineTuner do. Currently the default frozen_sklearn path accepts fparam_dim=-1 and silently treats it as disabled.

I verified the code locations against PR head c47834c83d54b8e372bc2119474f4a944a5618f7 and ran python3 -m compileall -q dpa_adapt successfully. I could not run the new pytest suite in this local checkout because pytest/numpy/dpdata/sklearn/torch are not installed in the available environment.

— OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5)

…ading, and fparam_dim validation issues - Split try/except in _resolve_type_maps so unsupported-element errors propagate instead of being silently swallowed as missing-atom-names - Make read_data_type_map_union skip all-Type_* placeholder names, consistent with _read_data_type_map, so MFT does not reject valid raw-index data - Add set.*/{key}.npy direct fallback to load_dataset for custom label files (e.g. homo.npy, bandgap.npy) not loaded into dpdata.System.data - Replace first/last-64 sampling in _system_fingerprint with full-array hashing so descriptor cache keys correctly invalidate when structures change - Validate fparam_dim as non-negative int in DPAFineTuner.__init__, matching DPATrainer and MFTFineTuner - Add scikit-learn to the test extra so DPA-ADAPT tests can run in all CI paths

…sh_array

…ew-fix duplication Follow-up to the review fixes in ba1f17c: the fixes were correct but copy-pasted logic across modules. Consolidate into shared helpers. - Add dpa_adapt/_validation.py with validate_fparam_dim(); reuse in DPATrainer, MFTFineTuner and DPAFineTuner __init__ (was triplicated) - Add _is_placeholder_type_map() in data/type_map.py; reuse in read_data_type_map_union and finetuner._read_data_type_map (also unifies the str() handling that previously differed between them) - Add _find_label_npys() in data/loader.py for set.*/{key}.npy discovery; reuse in dataset.load_dataset and finetuner._load_labels - Drop the redundant manual scikit-learn from test_python.yml; the test extra already provides it No behavior change: helper outputs verified identical to the inlined logic, and the dpa_adapt suite is unchanged (314 passed, 10 skipped; the one pre-existing test-isolation failure is unrelated).

…sion test_type_map.py and test_conditions.py injected a MagicMock as `torch` into sys.modules at import time via an unconditional `sys.modules.setdefault("torch", _mock_torch)`. During a full pytest run all test modules are imported in the collection phase, so when one of these files was imported before the real torch, the mock won the race and stayed in sys.modules for the whole session (no teardown). A later test doing real tensor math then got the mock: `feat.detach().cpu().numpy()` returned a MagicMock and `np.concatenate([mock])` collapsed to `array([], dtype=float64)`, failing test_extract_features_detaches_grad_tensors_before_numpy. Guard the stub behind `try: import torch / except` so it is only installed when torch is genuinely absent, matching the existing pattern in test_predictor.py. No effect when torch is missing. Full dpa_adapt suite: 318 passed, 7 skipped, 0 failed (was 314/10/1; the fix also un-skips 3 tests that the mock was falsely masking).

fix(dpa-adapt): resolve type-map validation, cache identity, label loading, and fparam_dim validation issues

for more information, see https://pre-commit.ci

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>

Fix CodeQL torch import warnings

zhaiwenxi and others added 30 commits May 27, 2026 16:08

feat: add DeePMD property tools

30351e9

[pre-commit.ci] auto fixes from pre-commit.com hooks

e9fe00f

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

db05969

for more information, see https://pre-commit.ci

Merge pull request #1 from zhaiwenxi/add-property-tools

311a620

feat: add DeePMD property tools

Add SMILES coordinate generation for property tools

05479d4

[pre-commit.ci] auto fixes from pre-commit.com hooks

4445f1d

for more information, see https://pre-commit.ci

Merge branch 'deepmodeling:master' into master

9be45cd

Merge pull request #2 from zhaiwenxi/add-property-tools

d5df6fa

Add property tools

feat: add dpa_tools as self-contained subpackage (PR 1)

52033d7

feat: add dp dpa CLI subcommand group (Branch A)

3e0c3f9

feat: centralize deepmd API calls into _backend.py chokepoint (Branch B)

ffe609c

Merge branch-b-backend (_backend.py chokepoint)

beb7b42

fix: use yield fixture for contract test hook cleanup (prevents state…

ab024dc

… leak)

docs: add dpa_tools Python and CLI API reference

da3f26f

Merge pull request #3 from zirenjin/master

bb3c971

dpa_tools merge

feat: merge property_tools SMILES pipeline into dpa_tools

57f61bd

feat: auto-detect format in dp dpa data convert, unify SMILES+structu…

f61f0c2

…re paths

chore: remove deepmd_property_tools, migrate tests+data to dpa_tools

392a1a5

chore: rename DATA/ → demo/

871d600

docs: update README — add SMILES pipeline, auto_convert, demo data

8a8ec93

refactor: fold mft into fit --strategy mft, batch-convert into conver…

ae78fea

…t, unify --target-key

docs: update README for refactored CLI and API (mft→fit, batch-conver…

fbfb5a0

…t→convert)

feat: auto-download built-in pretrained models via resolve_pretrained…

5bb1b53

…_path

fix: update property_tools_tests CI after migration to dpa_tools

217868c

The workflow still referenced the deleted deepmd_property_tools/ directory. Updated paths trigger to deepmd/dpa_tools/** and test command to source/tests/dpa_tools/. Added torch to lightweight dependencies.

fix: pin numpy<2.2 in lightweight CI for Python 3.10 compat

3b1ed2c

numpy 2.3+ requires Python>=3.11, but the property_tools_tests workflow runs on Python 3.10. Pin numpy>=1.21,<2.2 to keep the lightweight dependency install working on older Python.

Merge pull request #4 from zirenjin/master

93b2c5d

refactor: unify dpa_tools CLI/API and merge deepmd_property_tools

iProzd approved these changes Jun 30, 2026

View reviewed changes

Comment thread .gitignore Outdated

zirenjin and others added 2 commits June 30, 2026 15:32

chore(dpa-adapt): remove local output paths from .gitignore

5f5e735

Merge pull request #54 from zirenjin/chore/remove-formula-feature

d6e4ed3

refactor(dpa-adapt): remove formula-table conversion feature

njzjz enabled auto-merge June 30, 2026 11:10

njzjz disabled auto-merge June 30, 2026 11:13

Merge branch 'master' into master

c47834c

zhaiwenxi requested a review from njzjz-bot June 30, 2026 11:15

njzjz-bot suggested changes Jun 30, 2026

View reviewed changes

zirenjin and others added 10 commits June 30, 2026 22:00

style(dpa-adapt): remove redundant quotes from type annotation in _ha…

6605d5d

…sh_array

style(dpa-adapt): apply isort and ruff format to changed modules

9f6ca86

Merge branch 'zhaiwenxi:master' into chore/remove-formula-feature

13f3fb4

Merge pull request #55 from zirenjin/chore/remove-formula-feature

8807d3d

fix(dpa-adapt): resolve type-map validation, cache identity, label loading, and fparam_dim validation issues

Merge branch 'master' into master

88f5dad

[pre-commit.ci] auto fixes from pre-commit.com hooks

a7624ca

for more information, see https://pre-commit.ci

Merge branch 'master' into master

3f1f947

njzjz enabled auto-merge July 1, 2026 05:42

github-advanced-security AI found potential problems Jul 1, 2026

View reviewed changes

Comment thread source/tests/dpa_adapt/test_conditions.py Fixed

Comment thread source/tests/dpa_adapt/test_type_map.py Fixed

zhaiwenxi and others added 4 commits July 1, 2026 14:18

Fix CodeQL torch import warnings

7a1afbe

Potential fix for pull request finding 'CodeQL / Unused global variable'

b8d63ed

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>

Potential fix for pull request finding 'CodeQL / Unused global variable'

8b636d9

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>

Merge pull request #56 from zhaiwenxi/fix-codeql-torch-imports

f06ddce

Fix CodeQL torch import warnings

auto-merge was automatically disabled July 1, 2026 08:50
Head branch was pushed to by a user without write access

Merge branch 'master' into master

6268058

njzjz added this pull request to the merge queue Jul 1, 2026

Merged via the queue into deepmodeling:master with commit a49b6f7 Jul 1, 2026
57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add DPA-ADAPT toolkit for downstream property adaptation#5572

add DPA-ADAPT toolkit for downstream property adaptation#5572
njzjz merged 247 commits into
deepmodeling:masterfrom
zhaiwenxi:master

zhaiwenxi commented Jun 22, 2026 •

edited

Loading

Uh oh!

iProzd left a comment

Uh oh!

Uh oh!

njzjz-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

zhaiwenxi commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main changes

Summary by CodeRabbit

Uh oh!

iProzd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zhaiwenxi commented Jun 22, 2026 •

edited

Loading