add DPA-ADAPT toolkit for downstream property adaptation#5572
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
feat: add DeePMD property tools
for more information, see https://pre-commit.ci
Add property tools
dpa_tools merge
…t, unify --target-key
…utput parsing - DPAFineTuner: extract _FrozenSklearnPipeline helper; keep public API unchanged - MFTFineTuner: defer _read_fitting_net_from_ckpt to first access - DPATrainer._parse_test_output: single anchored regex per metric, auto-detect format
…perty metrics - _load_labels: accept str | list[str], stack columns for multi-property - build_sklearn_head: n_outputs param, wrap RF/Ridge with MultiOutputRegressor - evaluate: per-property mae/rmse/r2 dict when target_key is a list - freeze/DPAPredictor: store and load target_key as-is (str or list) - CLI: --target-key homo,lumo parsed via _maybe_split_list - 6 new tests covering fit, evaluate, freeze/load round-trip
The old _load_descriptor_model, _validate_type_map, _remap_atom_types, _extract_features_cached, and _extract_features method bodies were left in place alongside the new thin delegators, causing CodeQL 'variable defined multiple times' warnings. Removed the old bodies; kept _extract_features_cached on DPAFineTuner directly so that test patches on DPAFineTuner._extract_features are honoured through the cache wrapper.
… method - Replace try/except ImportError in _unwrap_multioutput with direct import (sklearn is always available when dpa_tools is loaded) - Remove _FrozenSklearnPipeline.extract_features_cached (dead code; the caching wrapper lives on DPAFineTuner so test patches work)
The workflow still referenced the deleted deepmd_property_tools/ directory. Updated paths trigger to deepmd/dpa_tools/** and test command to source/tests/dpa_tools/. Added torch to lightweight dependencies.
numpy 2.3+ requires Python>=3.11, but the property_tools_tests workflow runs on Python 3.10. Pin numpy>=1.21,<2.2 to keep the lightweight dependency install working on older Python.
refactor: unify dpa_tools CLI/API and merge deepmd_property_tools
Remove the `--fmt formula` pipeline that converted elemental composition formulas plus a template POSCAR into deepmd/npy systems via random atomic substitution on the host-element sublattice. - delete dpa_adapt/data/formula.py and source/tests/dpa_adapt/test_formula.py - drop the formula_to_npy exports and the fmt="formula" branch in convert() - remove the --poscar/--base-element/--formula-col/--sets CLI flags and the formula result handler from the data convert command - prune formula tests from test_convert.py and test_cli_smoke.py - drop the Formula Tables docs from the README and dpa_adapt guide The unrelated cross-validation group_by="formula" grouping is unchanged.
refactor(dpa-adapt): remove formula-table conversion feature
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks for the large DPA-ADAPT contribution. I found several correctness / integration issues that should be fixed before merge, so I’m requesting changes.
-
DPA-ADAPT tests require scikit-learn, but the
testextra does not install it.The new tests live under
source/tests/dpa_adapt, so they are included by the normalpytest source/testspaths. Some of them import DPA-ADAPT modules that import sklearn at module import time (source/tests/dpa_adapt/test_split_cv.py→dpa_adapt/cv.py). The CPU workflow currently masks this by manually addingscikit-learn, but other test paths do not: CUDA installs.[gpu,test,lmp,cu12,torch,jax]and runspython -m pytest source/tests, and the default tox env installs onlytest,cpu. Those will fail withModuleNotFoundError: sklearnunless the dependency is added to the appropriate test/CI extras or the optional tests skip cleanly. Prefer installing/testing the advertiseddpa-adaptextra in CI so the package metadata is validated too. -
The descriptor cache key can return stale descriptors for changed systems.
dpa_adapt/data/desc_cache.pybuilds_system_fingerprint()from metadata plus only the first/last 64 flattened entries ofcoordsandcells. Any change in the middle of a larger trajectory/system keeps the same cache key, so both the aggregate cache and per-system cache can reuse descriptors from a different structure. Since these descriptors directly drive fitting and prediction, the cache identity needs to cover the full relevant array contents (for example a streaming hash over all arrays, or a robust hash of allset.*/*.npyinputs) before this is safe. -
Unsupported training elements are silently ignored in type-map resolution.
In
dpa_adapt/finetuner.py,_resolve_type_maps()catches everyValueErrorfrom bothread_data_type_map_union()andvalidate_type_map_subset(). That means a real validation failure such as data containing an element not covered by the checkpoint is swallowed as if it were merely the “no atom_names” case, and training proceeds with an incompatible checkpoint type map. Please split thetryblock so only the missing-atom-names case is ignored, and unsupported-element validation errors propagate clearly. -
MFT rejects valid raw-index data without
type_map.raw.read_data_type_map_union()treats dpdata placeholder names such asType_0/Type_1as real elements. MFT then validates those against the checkpoint type map and fails, although the surrounding code comments say data withoutatom_namesshould be allowed to use raw atom indices. This is inconsistent withfinetuner._read_data_type_map(), which explicitly filters all-Type_*placeholder maps. Please make the shared type-map reader ignore all-placeholder maps, or reuse the same filtering logic before subset validation. -
load_dataset()does not use the direct custom-label fallback used elsewhere.dpa_adapt/data/dataset.pyonly checksresolved_key in system.data. Custom labels such asproperty.npy,homo.npy, orbandgap.npyunderset.*/are not generally loaded intodpdata.System.data;_load_labels()infinetuner.pyalready has a directset.*/{key}.npyfallback for this. Since the CV CLI callsload_dataset(args.data, label_key=...), valid datasets with these custom label files can be skipped as missing. Please share/reuse the same label-discovery semantics here.
A smaller consistency issue: DPAFineTuner.__init__ should validate fparam_dim the same way DPATrainer and MFTFineTuner do. Currently the default frozen_sklearn path accepts fparam_dim=-1 and silently treats it as disabled.
I verified the code locations against PR head c47834c83d54b8e372bc2119474f4a944a5618f7 and ran python3 -m compileall -q dpa_adapt successfully. I could not run the new pytest suite in this local checkout because pytest/numpy/dpdata/sklearn/torch are not installed in the available environment.
— OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5)
…ading, and fparam_dim validation issues
- Split try/except in _resolve_type_maps so unsupported-element errors propagate
instead of being silently swallowed as missing-atom-names
- Make read_data_type_map_union skip all-Type_* placeholder names, consistent
with _read_data_type_map, so MFT does not reject valid raw-index data
- Add set.*/{key}.npy direct fallback to load_dataset for custom label files
(e.g. homo.npy, bandgap.npy) not loaded into dpdata.System.data
- Replace first/last-64 sampling in _system_fingerprint with full-array hashing
so descriptor cache keys correctly invalidate when structures change
- Validate fparam_dim as non-negative int in DPAFineTuner.__init__, matching
DPATrainer and MFTFineTuner
- Add scikit-learn to the test extra so DPA-ADAPT tests can run in all CI paths
…ew-fix duplication Follow-up to the review fixes in ba1f17c: the fixes were correct but copy-pasted logic across modules. Consolidate into shared helpers. - Add dpa_adapt/_validation.py with validate_fparam_dim(); reuse in DPATrainer, MFTFineTuner and DPAFineTuner __init__ (was triplicated) - Add _is_placeholder_type_map() in data/type_map.py; reuse in read_data_type_map_union and finetuner._read_data_type_map (also unifies the str() handling that previously differed between them) - Add _find_label_npys() in data/loader.py for set.*/{key}.npy discovery; reuse in dataset.load_dataset and finetuner._load_labels - Drop the redundant manual scikit-learn from test_python.yml; the test extra already provides it No behavior change: helper outputs verified identical to the inlined logic, and the dpa_adapt suite is unchanged (314 passed, 10 skipped; the one pre-existing test-isolation failure is unrelated).
…sion
test_type_map.py and test_conditions.py injected a MagicMock as `torch`
into sys.modules at import time via an unconditional
`sys.modules.setdefault("torch", _mock_torch)`. During a full pytest run
all test modules are imported in the collection phase, so when one of
these files was imported before the real torch, the mock won the race and
stayed in sys.modules for the whole session (no teardown). A later test
doing real tensor math then got the mock: `feat.detach().cpu().numpy()`
returned a MagicMock and `np.concatenate([mock])` collapsed to
`array([], dtype=float64)`, failing
test_extract_features_detaches_grad_tensors_before_numpy.
Guard the stub behind `try: import torch / except` so it is only
installed when torch is genuinely absent, matching the existing pattern
in test_predictor.py. No effect when torch is missing.
Full dpa_adapt suite: 318 passed, 7 skipped, 0 failed (was 314/10/1; the
fix also un-skips 3 tests that the mock was falsely masking).
fix(dpa-adapt): resolve type-map validation, cache identity, label loading, and fparam_dim validation issues
for more information, see https://pre-commit.ci
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: zhaiwenxi <144502730+zhaiwenxi@users.noreply.github.com>
Fix CodeQL torch import warnings
Head branch was pushed to by a user without write access
Summary
This PR adds DPA-ADAPT, a toolkit for adapting pretrained DPA models to downstream atomistic property prediction tasks.
The new package provides a scikit-learn-style Python API and standalone CLI for fine-tuning, descriptor extraction, prediction, evaluation, cross-validation, and data preparation, without requiring users to manually write DeePMD-kit training input files.
Main changes
dpa_adaptPython package.dpa-adaptdpaadfrozen_sklearn: frozen DPA descriptors with scikit-learn regressorsfrozen_head: train a property head on top of a frozen DPA backbonefinetune: end-to-end DPA fine-tuningmft: multi-task fine-tuning with auxiliary energy/force trainingfparam.npydoc/dpa_adapt/.examples/dpa_adapt/.dpa-adaptoptional dependencies inpyproject.toml.source/tests/dpa_adapt/.Co-authored-by: zirenjin <zirenjin@umich.edu>
Summary by CodeRabbit