Skip to content

problem when using Cuda to accelerate the dp  #650

Description

@343333333

when i use cpu version ,it gose well. but problem happen, when i change the version to gpu after updating the relying program .
it seems the cuda is too old to start , but i write the "module load cuda/10.2" in the submission script, and it dose have the cuda10.2 in the service.
the log file :

# DEEPMD: installed to:         /tmp/pip-req-build-8l1_0ns9/_skbuild/linux-x86_64-3.7/cmake-install
# DEEPMD: source :              v1.3.3
# DEEPMD: source brach:         HEAD
# DEEPMD: source commit:        3a59596
# DEEPMD: source commit at:     2021-03-20 00:53:44 +0800
# DEEPMD: build float prec:     double
# DEEPMD: build with tf inc:    /work/Software/miniconda3/lib/python3.7/site-packages/tensorflow/include;/work/Software/miniconda3/lib/python3.7/site-packages/tensorflow/include
# DEEPMD: build with tf lib:    
# DEEPMD: running on:           gpu03
# DEEPMD: CUDA_VISIBLE_DEVICES: unset
# DEEPMD: num_intra_threads:    0
# DEEPMD: num_inter_threads:    0
# DEEPMD: -----------------------------------------------------------------
# DEEPMD: 
2021-05-21 10:25:15.310984: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-21 10:25:15.343416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-21 10:25:16.227336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:2f:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-05-21 10:25:16.228063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:86:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-05-21 10:25:16.228101: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-05-21 10:25:16.407775: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-05-21 10:25:16.407922: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-05-21 10:25:16.518689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-21 10:25:16.718146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-21 10:25:16.877558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-21 10:25:17.027248: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-05-21 10:25:17.272541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-05-21 10:25:17.275906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2021-05-21 10:25:17.275974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-05-21 10:25:17.276270: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
2021-05-21 10:25:17.276289: E tensorflow/c/c_api.cc:2184] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "/work/chem-wangyg/Software/miniconda3/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/main.py", line 73, in main
    train(args)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/train.py", line 87, in train
    _do_work(jdata, run_opt)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/train.py", line 91, in _do_work
    model = NNPTrainer (jdata, run_opt = run_opt)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/Trainer.py", line 49, in __init__
    self._init_param(jdata)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/Trainer.py", line 62, in _init_param
    self.descrpt = DescrptSeA(descrpt_param)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/deepmd/DescrptSeA.py", line 87, in __init__
    self.sub_sess = tf.Session(graph = sub_graph, config=default_tf_session_config)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/work/chem-wangyg/Software/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

this is my submission script:

#!/bin/bash
#BSUB -J dpmd
#BSUB -q gpu
#BSUB -n 12
#BSUB -e %J.err
#BSUB -o %J.out
#BSUB -R "span[ptile=24]"

module load cuda/10.2

cd 000
test $? -ne 0 && exit 1

if [ ! -f tag_0_finished ] ;then
  { if [ ! -f model.ckpt.index ]; then ~/Software/miniconda3/bin/dp train input.json; else ~/Software/miniconda3/bin/dp train input.json --restart model.ckpt; fi }  1>> train.log 2>> train.log 
  if test $? -ne 0; then exit 1; else touch tag_0_finished; fi 
fi

cd /work/dpgen/test/temp/160837f9-78bd-426e-8d22-3af727ea0ca4
test $? -ne 0 && exit 1

wait

cd 000
test $? -ne 0 && exit 1

if [ ! -f tag_1_finished ] ;then
  ~/Software/miniconda3/bin/dp freeze  1>> train.log 2>> train.log 
  if test $? -ne 0; then exit 1; else touch tag_1_finished; fi 
fi

cd /work/dpgen/test/temp/160837f9-78bd-426e-8d22-3af727ea0ca4
test $? -ne 0 && exit 1

wait


touch 160837f9-78bd-426e-8d22-3af727ea0ca4_tag_finished

Metadata

Metadata

Assignees

No one assigned

    Labels

    Fields

    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions