Skip to content

[BUG] Restarting from compressed training with type embedding throws errors #2989

Description

@njzjz

Bug summary

While #2253 has supported restarting from compressed training, it throws errors when there is a type embedding network, the compression of which is recently supported.

DeePMD-kit Version

v2.2.6-5-g05e0d277 05e0d27

TensorFlow Version

2.14.0

How did you download the software?

pip

Input Files, Running Commands, Error Log, etc.

DEEPMD INFO    restart from model /expanse/lustre/scratch/njzjz/temp_project/dpgen_workdir/edcc7d7eee0b46c95272cecf3e92035e627bd422/000/model.ckpt
2023-11-12 23:20:53.685453: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at save_restore_v2_ops.cc:230 : NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
2023-11-12 23:20:53.685540: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11581671788850863407
2023-11-12 23:20:53.685560: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8934105269770306061
2023-11-12 23:20:53.685576: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10222796056855848299
2023-11-12 23:20:53.685590: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1811572328383791
2023-11-12 23:20:53.685605: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6657606174836959017
2023-11-12 23:20:53.685619: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10257389219655477663
2023-11-12 23:20:53.685633: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14209040036261972729
2023-11-12 23:20:53.685647: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4478105972376267715
2023-11-12 23:20:53.685660: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4390465125761313783
2023-11-12 23:20:53.685674: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16662903473336077489
2023-11-12 23:20:53.685689: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7518786342170914469
2023-11-12 23:20:53.685702: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 13605621365309410923
2023-11-12 23:20:53.685716: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6034899277312585957
2023-11-12 23:20:53.685730: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2307083772700013163
2023-11-12 23:20:53.685745: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15690849590373675979
2023-11-12 23:20:53.685774: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 3836809912172884891
2023-11-12 23:20:53.685788: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 9360860283521929193
2023-11-12 23:20:53.685813: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 3198785021169096273
2023-11-12 23:20:53.685828: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10977160041772414833
2023-11-12 23:20:53.685841: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2054480923378125325
2023-11-12 23:20:53.685855: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8925475438488664739
2023-11-12 23:20:53.685868: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 17639971752919395663
2023-11-12 23:20:53.685882: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 9131381259486199081
2023-11-12 23:20:53.685895: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 18340517148283714999
2023-11-12 23:20:53.685908: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8859999761892474387
2023-11-12 23:20:53.685927: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 17508783014083560588
2023-11-12 23:20:53.685940: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16617325314970674672
2023-11-12 23:20:53.685953: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2210524981010933620
2023-11-12 23:20:53.685965: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6071039091021874948
2023-11-12 23:20:53.685978: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14269539609294891488
2023-11-12 23:20:53.685991: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16164682523471055994
2023-11-12 23:20:53.686004: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15833920992627133682
2023-11-12 23:20:53.686016: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16550654589173640560
2023-11-12 23:20:53.686029: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10479345600891009476
2023-11-12 23:20:53.686042: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4660860100295588560
2023-11-12 23:20:53.686056: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11467969978973060000
2023-11-12 23:20:53.686069: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 12562560747120106584
2023-11-12 23:20:53.686083: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6250945325056408782
2023-11-12 23:20:53.686096: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1451226510702157216
2023-11-12 23:20:53.686121: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10145272671814975468
Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1402, in _do_call
    return fn(*args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1385, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1478, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
         [[{{node save/RestoreV2}}]]
         [[save/RestoreV2/_49]]
  (1) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
         [[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1418, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 972, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1215, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1395, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1421, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:

Detected at node 'save/RestoreV2' defined at (most recent call last):
    File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
Detected at node 'save/RestoreV2' defined at (most recent call last):
    File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
2 root error(s) found.
  (0) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
         [[{{node save/RestoreV2}}]]
         [[save/RestoreV2/_49]]
  (1) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
         [[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
  File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 934, in __init__
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 946, in build
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 974, in _build
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 543, in _build_internal
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 360, in _AddRestoreOps
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 611, in bulk_restore
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restore_v2
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 66, in get_tensor
    return CheckpointReader.CheckpointReader_GetTensor(
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1429, in restore
    names_to_keys = object_graph_key_mapping(save_path
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1751, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 71, in get_tensor
    error_translator(e)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
    deepmd_main(args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
    train_dp(**dict_args)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
    _do_work(jdata, run_opt, is_compress)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
    model.train(train_data, valid_data)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
    self._init_session()
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 513, in _init_session
    self.saver.restore(self.sess, self.run_opt.restart)
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 1434, in restore
    raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Graph execution error:

Detected at node 'save/RestoreV2' defined at (most recent call last):
    File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
Detected at node 'save/RestoreV2' defined at (most recent call last):
    File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
    File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
Node: 'save/RestoreV2'
2 root error(s) found.
  (0) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
         [[{{node save/RestoreV2}}]]
         [[save/RestoreV2/_49]]
  (1) NOT_FOUND: Key type_embed_net/bias_1 not found in checkpoint
         [[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
  File "/home/njzjz/anaconda3/envs/pip/bin/dp", line 8, in <module>
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd_cli/main.py", line 635, in main
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 74, in main
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 168, in train
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 285, in _do_work
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 544, in train
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/train/trainer.py", line 496, in _init_session
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 934, in __init__
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 946, in build
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 974, in _build
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 543, in _build_internal
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 360, in _AddRestoreOps
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/training/saver.py", line 611, in bulk_restore
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1521, in restore_v2
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal
  File "/home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def

Steps to Reproduce

dp train input.json
dp train input.json -r model.ckpt

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions