Skip to content

[BUG] The Error from Neighbor Statistics in dpmd v2.2.10 #3960

Description

@robinzyb

Bug summary

With the Dockerfile provided by the CSCS engineer, I compiled the dpmd version 2.2.10 in the new cluster. But when I tried to test dp train with my dataset. I found it threw an error.

WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
/usr/local/lib/python3.10/dist-packages/deepmd_utils/utils/compat.py:362: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead.
  warnings.warn(
DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2024-07-09 14:03:12.068634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94677 MB memory:  -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-09 14:03:12.089546: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-09 14:03:12.143871: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaf5979450
Traceback (most recent call last):
  File "/usr/local/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
    deepmd_main(args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 74, in main
    train_dp(**dict_args)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 149, in train
    jdata = update_sel(jdata)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 512, in update_sel
    jdata_cpy["model"] = Model.update_sel(jdata, jdata["model"])
  File "/usr/local/lib/python3.10/dist-packages/deepmd/model/model.py", line 566, in update_sel
    return cls.update_sel(global_jdata, local_jdata)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/model/model.py", line 723, in update_sel
    local_jdata_cpy["descriptor"] = Descriptor.update_sel(
  File "/usr/local/lib/python3.10/dist-packages/deepmd/descriptor/descriptor.py", line 511, in update_sel
    return cls.update_sel(global_jdata, local_jdata)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/descriptor/se.py", line 162, in update_sel
    return update_one_sel(global_jdata, local_jdata_cpy, False)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 479, in update_one_sel
    tmp_sel = get_sel(
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 440, in get_sel
    _, max_nbor_size = get_nbor_stat(jdata, rcut, one_type=one_type)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 425, in get_nbor_stat
    min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
  File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 220, in get_stat
    raise RuntimeError(
RuntimeError: Some atoms are overlapping in /capstor/scratch/cscs/zyongbin/dpmd_test/all_data_set/data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms.

From my validation, I found there is no atom overlapping with each other in the data set. The dataset and python script to check the overlapping are provided below.

When I only modify the dpmd version number to v2.2.9 in the Docker file, I found the neighbor statistics works well.

DeePMD-kit v2.2.9
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD INFO    training data with min nbor dist: 0.8696584261076291
DEEPMD INFO    training data with max nbor size: [13 68 54 14]
DEEPMD INFO    min_nbor_dist: 0.869658
DEEPMD INFO    max_nbor_size: [13 68 54 14]

I believe this is not fixed when updating to dpmd v2.2.11 because the change from v2.2.10 to v2.2.11 is only the f-string formatting in the deepmd/utils/neighbor_stat.py

DeePMD-kit Version

2.2.10 and 2.2.9

Backend and its version

no provided

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

A home made docker image

Steps to Reproduce

  1. Download my data set
  2. dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0 for version 2.2.9
  3. test the same command for version 2.2.10

Further Information, Files, and Links

check.tgz

Metadata

Metadata

Assignees

No one assigned

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions