Skip to content

add DPA-ADAPT toolkit for downstream property adaptation#5572

Merged
njzjz merged 247 commits into
deepmodeling:masterfrom
zhaiwenxi:master
Jul 1, 2026
Merged

add DPA-ADAPT toolkit for downstream property adaptation#5572
njzjz merged 247 commits into
deepmodeling:masterfrom
zhaiwenxi:master

Conversation

@zhaiwenxi

@zhaiwenxi zhaiwenxi commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds DPA-ADAPT, a toolkit for adapting pretrained DPA models to downstream atomistic property prediction tasks.

The new package provides a scikit-learn-style Python API and standalone CLI for fine-tuning, descriptor extraction, prediction, evaluation, cross-validation, and data preparation, without requiring users to manually write DeePMD-kit training input files.

Main changes

  • Add the top-level dpa_adapt Python package.
  • Add standalone CLI entry points:
    • dpa-adapt
    • dpaad
  • Support multiple adaptation strategies:
    • frozen_sklearn: frozen DPA descriptors with scikit-learn regressors
    • frozen_head: train a property head on top of a frozen DPA backbone
    • finetune: end-to-end DPA fine-tuning
    • mft: multi-task fine-tuning with auxiliary energy/force training
  • Add data utilities for:
    • DeepMD/npy loading and validation
    • label attachment
    • descriptor caching
    • train/test split and cross-validation
    • SMILES/formula-based conversion workflows
    • optional frame parameters via fparam.npy
  • Add prediction and evaluation helpers with MAE, RMSE, and R2 reporting.
  • Add documentation under doc/dpa_adapt/.
  • Add a runnable QM9 HOMO-LUMO gap example under examples/dpa_adapt/.
  • Add dpa-adapt optional dependencies in pyproject.toml.
  • Add dedicated lightweight CI for source/tests/dpa_adapt/.

Co-authored-by: zirenjin <zirenjin@umich.edu>

Summary by CodeRabbit

  • New Features
    • Added the DPA-ADAPT toolkit with a new command-line interface for data conversion, validation, training, prediction, evaluation, and descriptor extraction.
    • Introduced support for multiple adaptation workflows, including frozen-sklearn, frozen-head, fine-tuning, and multi-task training.
    • Added data handling for SMILES, formulas, structures, label attachment, and condition features.
    • Included a new example workflow and expanded user documentation for setup and usage.

zhaiwenxi and others added 30 commits May 27, 2026 16:08
…utput parsing

- DPAFineTuner: extract _FrozenSklearnPipeline helper; keep public API unchanged
- MFTFineTuner: defer _read_fitting_net_from_ckpt to first access
- DPATrainer._parse_test_output: single anchored regex per metric, auto-detect format
…perty metrics

- _load_labels: accept str | list[str], stack columns for multi-property
- build_sklearn_head: n_outputs param, wrap RF/Ridge with MultiOutputRegressor
- evaluate: per-property mae/rmse/r2 dict when target_key is a list
- freeze/DPAPredictor: store and load target_key as-is (str or list)
- CLI: --target-key homo,lumo parsed via _maybe_split_list
- 6 new tests covering fit, evaluate, freeze/load round-trip
The old _load_descriptor_model, _validate_type_map, _remap_atom_types,
_extract_features_cached, and _extract_features method bodies were left
in place alongside the new thin delegators, causing CodeQL 'variable
defined multiple times' warnings.  Removed the old bodies; kept
_extract_features_cached on DPAFineTuner directly so that test patches
on DPAFineTuner._extract_features are honoured through the cache wrapper.
… method

- Replace try/except ImportError in _unwrap_multioutput with direct import
  (sklearn is always available when dpa_tools is loaded)
- Remove _FrozenSklearnPipeline.extract_features_cached (dead code;
  the caching wrapper lives on DPAFineTuner so test patches work)
The workflow still referenced the deleted deepmd_property_tools/ directory.
Updated paths trigger to deepmd/dpa_tools/** and test command to
source/tests/dpa_tools/. Added torch to lightweight dependencies.
numpy 2.3+ requires Python>=3.11, but the property_tools_tests workflow
runs on Python 3.10. Pin numpy>=1.21,<2.2 to keep the lightweight
dependency install working on older Python.
refactor: unify dpa_tools CLI/API and merge deepmd_property_tools
Remove the `--fmt formula` pipeline that converted elemental composition
formulas plus a template POSCAR into deepmd/npy systems via random atomic
substitution on the host-element sublattice.

- delete dpa_adapt/data/formula.py and source/tests/dpa_adapt/test_formula.py
- drop the formula_to_npy exports and the fmt="formula" branch in convert()
- remove the --poscar/--base-element/--formula-col/--sets CLI flags and the
  formula result handler from the data convert command
- prune formula tests from test_convert.py and test_cli_smoke.py
- drop the Formula Tables docs from the README and dpa_adapt guide

The unrelated cross-validation group_by="formula" grouping is unchanged.

@iProzd iProzd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM.

Comment thread .gitignore Outdated
@njzjz njzjz enabled auto-merge June 30, 2026 11:10
@njzjz njzjz disabled auto-merge June 30, 2026 11:13
@zhaiwenxi zhaiwenxi requested a review from njzjz-bot June 30, 2026 11:15

@njzjz-bot njzjz-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the large DPA-ADAPT contribution. I found several correctness / integration issues that should be fixed before merge, so I’m requesting changes.

  1. DPA-ADAPT tests require scikit-learn, but the test extra does not install it.

    The new tests live under source/tests/dpa_adapt, so they are included by the normal pytest source/tests paths. Some of them import DPA-ADAPT modules that import sklearn at module import time (source/tests/dpa_adapt/test_split_cv.pydpa_adapt/cv.py). The CPU workflow currently masks this by manually adding scikit-learn, but other test paths do not: CUDA installs .[gpu,test,lmp,cu12,torch,jax] and runs python -m pytest source/tests, and the default tox env installs only test,cpu. Those will fail with ModuleNotFoundError: sklearn unless the dependency is added to the appropriate test/CI extras or the optional tests skip cleanly. Prefer installing/testing the advertised dpa-adapt extra in CI so the package metadata is validated too.

  2. The descriptor cache key can return stale descriptors for changed systems.

    dpa_adapt/data/desc_cache.py builds _system_fingerprint() from metadata plus only the first/last 64 flattened entries of coords and cells. Any change in the middle of a larger trajectory/system keeps the same cache key, so both the aggregate cache and per-system cache can reuse descriptors from a different structure. Since these descriptors directly drive fitting and prediction, the cache identity needs to cover the full relevant array contents (for example a streaming hash over all arrays, or a robust hash of all set.*/*.npy inputs) before this is safe.

  3. Unsupported training elements are silently ignored in type-map resolution.

    In dpa_adapt/finetuner.py, _resolve_type_maps() catches every ValueError from both read_data_type_map_union() and validate_type_map_subset(). That means a real validation failure such as data containing an element not covered by the checkpoint is swallowed as if it were merely the “no atom_names” case, and training proceeds with an incompatible checkpoint type map. Please split the try block so only the missing-atom-names case is ignored, and unsupported-element validation errors propagate clearly.

  4. MFT rejects valid raw-index data without type_map.raw.

    read_data_type_map_union() treats dpdata placeholder names such as Type_0/Type_1 as real elements. MFT then validates those against the checkpoint type map and fails, although the surrounding code comments say data without atom_names should be allowed to use raw atom indices. This is inconsistent with finetuner._read_data_type_map(), which explicitly filters all-Type_* placeholder maps. Please make the shared type-map reader ignore all-placeholder maps, or reuse the same filtering logic before subset validation.

  5. load_dataset() does not use the direct custom-label fallback used elsewhere.

    dpa_adapt/data/dataset.py only checks resolved_key in system.data. Custom labels such as property.npy, homo.npy, or bandgap.npy under set.*/ are not generally loaded into dpdata.System.data; _load_labels() in finetuner.py already has a direct set.*/{key}.npy fallback for this. Since the CV CLI calls load_dataset(args.data, label_key=...), valid datasets with these custom label files can be skipped as missing. Please share/reuse the same label-discovery semantics here.

A smaller consistency issue: DPAFineTuner.__init__ should validate fparam_dim the same way DPATrainer and MFTFineTuner do. Currently the default frozen_sklearn path accepts fparam_dim=-1 and silently treats it as disabled.

I verified the code locations against PR head c47834c83d54b8e372bc2119474f4a944a5618f7 and ran python3 -m compileall -q dpa_adapt successfully. I could not run the new pytest suite in this local checkout because pytest/numpy/dpdata/sklearn/torch are not installed in the available environment.

— OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5)

zirenjin and others added 10 commits June 30, 2026 22:00
…ading, and fparam_dim validation issues

- Split try/except in _resolve_type_maps so unsupported-element errors propagate
  instead of being silently swallowed as missing-atom-names
- Make read_data_type_map_union skip all-Type_* placeholder names, consistent
  with _read_data_type_map, so MFT does not reject valid raw-index data
- Add set.*/{key}.npy direct fallback to load_dataset for custom label files
  (e.g. homo.npy, bandgap.npy) not loaded into dpdata.System.data
- Replace first/last-64 sampling in _system_fingerprint with full-array hashing
  so descriptor cache keys correctly invalidate when structures change
- Validate fparam_dim as non-negative int in DPAFineTuner.__init__, matching
  DPATrainer and MFTFineTuner
- Add scikit-learn to the test extra so DPA-ADAPT tests can run in all CI paths
…ew-fix duplication

Follow-up to the review fixes in ba1f17c: the fixes were correct but
copy-pasted logic across modules. Consolidate into shared helpers.

- Add dpa_adapt/_validation.py with validate_fparam_dim(); reuse in
  DPATrainer, MFTFineTuner and DPAFineTuner __init__ (was triplicated)
- Add _is_placeholder_type_map() in data/type_map.py; reuse in
  read_data_type_map_union and finetuner._read_data_type_map (also
  unifies the str() handling that previously differed between them)
- Add _find_label_npys() in data/loader.py for set.*/{key}.npy discovery;
  reuse in dataset.load_dataset and finetuner._load_labels
- Drop the redundant manual scikit-learn from test_python.yml; the test
  extra already provides it

No behavior change: helper outputs verified identical to the inlined
logic, and the dpa_adapt suite is unchanged (314 passed, 10 skipped;
the one pre-existing test-isolation failure is unrelated).
…sion

test_type_map.py and test_conditions.py injected a MagicMock as `torch`
into sys.modules at import time via an unconditional
`sys.modules.setdefault("torch", _mock_torch)`. During a full pytest run
all test modules are imported in the collection phase, so when one of
these files was imported before the real torch, the mock won the race and
stayed in sys.modules for the whole session (no teardown). A later test
doing real tensor math then got the mock: `feat.detach().cpu().numpy()`
returned a MagicMock and `np.concatenate([mock])` collapsed to
`array([], dtype=float64)`, failing
test_extract_features_detaches_grad_tensors_before_numpy.

Guard the stub behind `try: import torch / except` so it is only
installed when torch is genuinely absent, matching the existing pattern
in test_predictor.py. No effect when torch is missing.

Full dpa_adapt suite: 318 passed, 7 skipped, 0 failed (was 314/10/1; the
fix also un-skips 3 tests that the mock was falsely masking).
fix(dpa-adapt): resolve type-map validation, cache identity, label loading, and fparam_dim validation issues
@njzjz njzjz enabled auto-merge July 1, 2026 05:42
Comment thread source/tests/dpa_adapt/test_conditions.py Fixed
Comment thread source/tests/dpa_adapt/test_type_map.py Fixed
zhaiwenxi and others added 4 commits July 1, 2026 14:18
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>
auto-merge was automatically disabled July 1, 2026 08:50

Head branch was pushed to by a user without write access

@njzjz njzjz added this pull request to the merge queue Jul 1, 2026
Merged via the queue into deepmodeling:master with commit a49b6f7 Jul 1, 2026
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

7 participants