Bug summary
With the Dockerfile provided by the CSCS engineer, I compiled the dpmd version 2.2.10 in the new cluster. But when I tried to test dp train with my dataset. I found it threw an error.
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:deepmd.train.run_options:Switch to serial execution due to lack of horovod module.
/usr/local/lib/python3.10/dist-packages/deepmd_utils/utils/compat.py:362: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead.
warnings.warn(
DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2024-07-09 14:03:12.068634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 94677 MB memory: -> device: 0, name: GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2024-07-09 14:03:12.089546: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-07-09 14:03:12.143871: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaaf5979450
Traceback (most recent call last):
File "/usr/local/bin/dp", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/deepmd_utils/main.py", line 657, in main
deepmd_main(args)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/main.py", line 74, in main
train_dp(**dict_args)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 149, in train
jdata = update_sel(jdata)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 512, in update_sel
jdata_cpy["model"] = Model.update_sel(jdata, jdata["model"])
File "/usr/local/lib/python3.10/dist-packages/deepmd/model/model.py", line 566, in update_sel
return cls.update_sel(global_jdata, local_jdata)
File "/usr/local/lib/python3.10/dist-packages/deepmd/model/model.py", line 723, in update_sel
local_jdata_cpy["descriptor"] = Descriptor.update_sel(
File "/usr/local/lib/python3.10/dist-packages/deepmd/descriptor/descriptor.py", line 511, in update_sel
return cls.update_sel(global_jdata, local_jdata)
File "/usr/local/lib/python3.10/dist-packages/deepmd/descriptor/se.py", line 162, in update_sel
return update_one_sel(global_jdata, local_jdata_cpy, False)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 479, in update_one_sel
tmp_sel = get_sel(
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 440, in get_sel
_, max_nbor_size = get_nbor_stat(jdata, rcut, one_type=one_type)
File "/usr/local/lib/python3.10/dist-packages/deepmd/entrypoints/train.py", line 425, in get_nbor_stat
min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
File "/usr/local/lib/python3.10/dist-packages/deepmd/utils/neighbor_stat.py", line 220, in get_stat
raise RuntimeError(
RuntimeError: Some atoms are overlapping in /capstor/scratch/cscs/zyongbin/dpmd_test/all_data_set/data_set_cll_v1/set.000. Please check your training data to remove duplicated atoms.
From my validation, I found there is no atom overlapping with each other in the data set. The dataset and python script to check the overlapping are provided below.
When I only modify the dpmd version number to v2.2.9 in the Docker file, I found the neighbor statistics works well.
DeePMD-kit v2.2.9
WARNING:tensorflow:From /users/zyongbin/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD INFO training data with min nbor dist: 0.8696584261076291
DEEPMD INFO training data with max nbor size: [13 68 54 14]
DEEPMD INFO min_nbor_dist: 0.869658
DEEPMD INFO max_nbor_size: [13 68 54 14]
I believe this is not fixed when updating to dpmd v2.2.11 because the change from v2.2.10 to v2.2.11 is only the f-string formatting in the deepmd/utils/neighbor_stat.py
DeePMD-kit Version
2.2.10 and 2.2.9
Backend and its version
no provided
How did you download the software?
Others (write below)
Input Files, Running Commands, Error Log, etc.
A home made docker image
Steps to Reproduce
- Download my data set
dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0 for version 2.2.9
- test the same command for version 2.2.10
Further Information, Files, and Links
check.tgz
Bug summary
With the Dockerfile provided by the CSCS engineer, I compiled the dpmd version 2.2.10 in the new cluster. But when I tried to test dp train with my dataset. I found it threw an error.
From my validation, I found there is no atom overlapping with each other in the data set. The dataset and python script to check the overlapping are provided below.
When I only modify the dpmd version number to v2.2.9 in the Docker file, I found the neighbor statistics works well.
I believe this is not fixed when updating to dpmd v2.2.11 because the change from v2.2.10 to v2.2.11 is only the f-string formatting in the
deepmd/utils/neighbor_stat.pyDeePMD-kit Version
2.2.10 and 2.2.9
Backend and its version
no provided
How did you download the software?
Others (write below)
Input Files, Running Commands, Error Log, etc.
A home made docker image
Steps to Reproduce
dp neighbor-stat -s data_set_cll_v1 -t Bi H O V -r 6.0for version 2.2.9Further Information, Files, and Links
check.tgz