DEEPMD INFO restart from model /expanse/lustre/scratch/njzjz/temp_project/dpgen_workdir/edcc7d7eee0b46c95272cecf3e92035e627bd422/000/model.ckpt
2023-11-12 23:20:53.685453: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at save_restore_v2_ops.cc:230 : NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
2023-11-12 23:20:53.685540: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11581671788850863407
2023-11-12 23:20:53.685560: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8934105269770306061
2023-11-12 23:20:53.685576: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10222796056855848299
2023-11-12 23:20:53.685590: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1811572328383791
2023-11-12 23:20:53.685605: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6657606174836959017
2023-11-12 23:20:53.685619: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10257389219655477663
2023-11-12 23:20:53.685633: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14209040036261972729
2023-11-12 23:20:53.685647: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4478105972376267715
2023-11-12 23:20:53.685660: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4390465125761313783
2023-11-12 23:20:53.685674: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16662903473336077489
2023-11-12 23:20:53.685689: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7518786342170914469
2023-11-12 23:20:53.685702: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 13605621365309410923
2023-11-12 23:20:53.685716: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6034899277312585957
2023-11-12 23:20:53.685730: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2307083772700013163
2023-11-12 23:20:53.685745: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15690849590373675979
2023-11-12 23:20:53.685774: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 3836809912172884891
2023-11-12 23:20:53.685788: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 9360860283521929193
2023-11-12 23:20:53.685813: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 3198785021169096273
2023-11-12 23:20:53.685828: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10977160041772414833
2023-11-12 23:20:53.685841: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2054480923378125325
2023-11-12 23:20:53.685855: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8925475438488664739
2023-11-12 23:20:53.685868: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 17639971752919395663
2023-11-12 23:20:53.685882: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 9131381259486199081
2023-11-12 23:20:53.685895: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 18340517148283714999
2023-11-12 23:20:53.685908: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8859999761892474387
2023-11-12 23:20:53.685927: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 17508783014083560588
2023-11-12 23:20:53.685940: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16617325314970674672
2023-11-12 23:20:53.685953: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2210524981010933620
2023-11-12 23:20:53.685965: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6071039091021874948
2023-11-12 23:20:53.685978: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14269539609294891488
2023-11-12 23:20:53.685991: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16164682523471055994
2023-11-12 23:20:53.686004: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15833920992627133682
2023-11-12 23:20:53.686016: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16550654589173640560
2023-11-12 23:20:53.686029: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10479345600891009476
2023-11-12 23:20:53.686042: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4660860100295588560
2023-11-12 23:20:53.686056: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11467969978973060000
2023-11-12 23:20:53.686069: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 12562560747120106584
2023-11-12 23:20:53.686083: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6250945325056408782
2023-11-12 23:20:53.686096: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1451226510702157216
2023-11-12 23:20:53.686121: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10145272671814975468
Traceback (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1402, in _do_call
return fn(*args)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1385, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1478, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_49]]
(1) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
[[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1418, in restore
sess.run(self.saver_def.restore_op_name,
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 972, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1215, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1395, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1421, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:
Detected at node 'save/RestoreV2' defined at (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
Detected at node 'save/RestoreV2' defined at (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
2 root error(s) found.
(0) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_49]]
(1) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
[[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'save/RestoreV2':
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 934, in __init__
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 946, in build
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 974, in _build
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 543, in _build_internal
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 360, in _AddRestoreOps
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 611, in bulk_restore
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restore_v2
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 66, in get_tensor
return CheckpointReader.CheckpointReader_GetTensor(
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1429, in restore
names_to_keys = object_graph_key_mapping(save_path
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1751, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 71, in get_tensor
error_translator(e)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
sys.exit(main())
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
deepmd_main(args)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
train_dp(**dict_args)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
_do_work(jdata, run_opt, is_compress)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
model.train(train_data, valid_data)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
self._init_session()
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 513, in _init_session
self.saver.restore(self.sess, self.run_opt.restart)
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1434, in restore
raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Graph execution error:
Detected at node 'save/RestoreV2' defined at (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
Detected at node 'save/RestoreV2' defined at (most recent call last):
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
2 root error(s) found.
(0) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_49]]
(1) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
[[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'save/RestoreV2':
File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 934, in __init__
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 946, in build
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 974, in _build
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 543, in _build_internal
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 360, in _AddRestoreOps
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 611, in bulk_restore
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restore_v2
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal
File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def
Bug summary
While #2253 has supported restarting from compressed training, it throws errors when there is a type embedding network, the compression of which is recently supported.
DeePMD-kit Version
v2.2.6-5-g05e0d277 05e0d27
TensorFlow Version
2.14.0
How did you download the software?
pip
Input Files, Running Commands, Error Log, etc.
Steps to Reproduce
Further Information, Files, and Links
No response