Enable mixed precision support for deepmd-kit by denghuilu · Pull Request #1285 · deepmodeling/deepmd-kit

denghuilu · 2021-11-15T18:51:49Z

This PR has enabled the mixed-precision training as well as the mixed precision inference process for deepmd-kit. Without any change of the input script, one can easily enable the mixed precision training by simply setting the environment variable DP_ENABLE_MIXED_PREC to fp16.

Main changes:

add DP_ENABLE_MIXED_PREC environmental variable for the control of mixed precision training. Note currently only tf.float16 precision is enabled with the mixed precision setting.
set the default embedding-net and fitting-net precision at argcheck.py according to the environment variable DP_INTERFACE_PREC.
use dynamic loss scale for gradients update.
add doc for mixed precision suppport.

According to our example water benchmark system, with TF-2.6.0, CUDA-11.0 and NVIDIA-V100 GPU environment, the speed of the dp training process decreased slightly, while the inference process with 12288 atoms has gained a speedup by a factor of 3.

It is strongly recommended to enable the mixed precision settings with CUDA-11.0 or above CUDA-toolkit.

…-kit into mixed-precision

njzjz · 2021-11-15T19:08:40Z

@wanghan-iapcm an import error is caught in the latest dpdata

codecov-commenter · 2021-11-15T19:34:41Z

Codecov Report

Merging #1285 (e7d357b) into devel (4af4ea5) will not change coverage.
The diff coverage is n/a.

❗ Current head e7d357b differs from pull request most recent head 6fa19c9. Consider uploading reports for the commit 6fa19c9 to get more accurate results

@@           Coverage Diff           @@
##            devel    #1285   +/-   ##
=======================================
  Coverage   64.28%   64.28%           
=======================================
  Files           5        5           
  Lines          14       14           
=======================================
  Hits            9        9           
  Misses          5        5

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4af4ea5...6fa19c9. Read the comment docs.

njzjz · 2021-11-15T22:09:29Z

            optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate)
+        if DP_ENABLE_MIXED_PRECISION:
+            # enable dynamic loss scale of the gradients
+            optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)


This function has been moved to tf.mixed_precision.enable_mixed_precision_graph_rewrite. https://www.tensorflow.org/api_docs/python/tf/compat/v1/mixed_precision/enable_mixed_precision_graph_rewrite What TF version do you use? Do you know if it is supported in all TF versions?

This function was found in Nvidia's official documentation. I have tested it with the TF-1.14.0 and TF-2.6.0 environment. Since it is a deprecated function, I will use the new tf.mixed_precision.enable_mixed_precision_graph_rewrite function.

The method was available since v1.12 (tensorflow/tensorflow@02730dc) and then was renamed in v2.4 (tensorflow/tensorflow@0112286). We may need to raise an error for TF<1.12.

amcadmus · 2021-11-16T00:22:44Z

@wanghan-iapcm an import error is caught in the latest dpdata

pymatgen... could you please help fix it? thanks!

njzjz · 2021-11-16T01:47:43Z

pymatgen... could you please help fix it? thanks!

See deepmodeling/dpdata#217.

denghuilu · 2021-11-17T06:51:52Z

There are some problems in the mixed precision training on the descriptors of se_r and se_t types, which are under investigation.

denghuilu · 2021-11-21T13:40:07Z

There are some problems in the mixed precision training on the descriptors of se_r and se_t types, which are under investigation.

@amcadmus @njzjz There's still some errors when training mixed precision with the se_r or se_t types. So I suggest that we merge the se_a type first.

njzjz · 2021-11-21T15:13:11Z

            optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate)
+        if self.mixed_prec is not None:
+            # check the TF_VERSION, when TF < 1.12, mixed precision is not allowed 
+            if TF_VERSION < "1.12":


>>> "1.8"<"1.12" False

njzjz · 2021-11-21T19:52:06Z

Can you also support hybrid?

denghuilu · 2021-11-21T23:46:44Z

Can you also support hybrid?

As we said, there's still some errors when using the se_r or se_t type descriptor. Hybrid is not yet ready for using.

njzjz · 2021-11-22T00:08:34Z

It will be useful to hybrid mixed by two se_a.

wanghan-iapcm · 2021-11-22T00:58:23Z

-              trainable = True,
+              trainable = False,


Why introduce this change?

typo for debug, I'll fix it

wanghan-iapcm · 2021-11-22T01:00:10Z

                            trainable = trainable)
        variable_summaries(b, 'bias')
+
+        if mixed_prec is not None and outputs_size != 1:


I do not like this idea.
For dipole and polar, the size of output layer is not 1, but they are using fp16, which is not what we want.

wanghan-iapcm · 2021-11-22T01:00:38Z

+                    if mixed_prec is not None and outputs_size != 1:
+                       idt = tf.cast(idt, get_precision(mixed_prec['compute_prec']))


Again outputs_size != 1 may not be a good idea.

wanghan-iapcm · 2021-11-22T01:01:44Z

+        if self.mixed_prec is not None:
+            inputs = tf.cast(inputs, get_precision(self.mixed_prec['compute_prec']))


Do we need this line? the inputs are anyway cast to compute_prec in networks.one_layer or networks.embedding_net

There's matrix multiplication outside the embedding net, we need to cast the inputs to match the dtype of the embedding net output.

Half precision slicing will be more efficient.

njzjz · 2021-11-22T03:43:23Z

+    def enable_mixed_precision(self, mixed_prec : dict = None) -> None:
+        """
+        Reveive the mixed precision setting.
+
+        Parameters
+        ----------
+        mixed_prec
+                The mixed precision setting used in the embedding net
+
+        Notes
+        -----
+        This method is called by others when the descriptor supported compression.
+        """
+        raise NotImplementedError(
+            "Descriptor %s doesn't support mixed precision training!" % type(self).__name__)
+
+


lint errors appear here

njzjz · 2021-11-22T03:48:58Z

        else:
            optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate)
+        if self.mixed_prec is not None:
+            TF_VERSION_LIST = [int(item) for item in TF_VERSION.split('.')]


int(item) will cause an error if the version is a pre-release, e.g. v2.6.0-rc1. See https://github.com/tensorflow/tensorflow/blob/ff68385595088304cf772086b9a259a65b007622/tensorflow/core/public/version.h#L35-L37

I suggest to use a third-party class Specifiers

njzjz · 2021-11-22T03:53:13Z

+        Argument("output_prec", str, optional=True, default="float32", doc=doc_output_prec),
+        Argument("compute_prec", str, optional=False, default="float16", doc=doc_compute_prec),


The default behavior is to enable mixed precision?

The mixed_precision session is optional within the training session(see line 617), so it's false by default. However, when one have set the mixed_precision session, one must provide the compute_prec key.

denghuilu added 5 commits November 16, 2021 01:34

enable mixed precision support for dp

05086f5

set the default embedding net & fitting net precision

e1cc674

add doc for mixed precision

1589b12

fix typo

fb48b01

Merge branch 'mixed-precision' of https://github.com/denghuilu/deepmd…

78f2914

…-kit into mixed-precision

denghuilu requested review from njzjz and wanghan-iapcm November 15, 2021 18:53

fix UT bug

4aae04b

njzjz reviewed Nov 15, 2021

View reviewed changes

denghuilu added 2 commits November 21, 2021 21:12

use input script to control the mixed precision workflow

5b633a8

add tf version check for mixed precision

b47c56d

denghuilu requested a review from njzjz November 21, 2021 13:40

denghuilu added 2 commits November 21, 2021 21:54

Update training-advanced.md

af3fcfb

fix typo

646233e

njzjz reviewed Nov 21, 2021

View reviewed changes

denghuilu added 2 commits November 21, 2021 23:22

fix TF_VERSION control

e945ed0

fix TF_VERSION comparison

972a5b1

njzjz approved these changes Nov 21, 2021

View reviewed changes

wanghan-iapcm reviewed Nov 22, 2021

View reviewed changes

denghuilu added 2 commits November 22, 2021 09:02

enable mixed precision for hybrid descriptor

fcdfb31

Update network.py

b868ea3

use parameter to control the network mixed precision output precision

6d517ed

njzjz reviewed Nov 22, 2021

View reviewed changes

denghuilu added 2 commits November 22, 2021 13:52

add example for mixed precision training workflow

e447ab5

fix lint errors

6fa19c9

wanghan-iapcm approved these changes Nov 23, 2021

View reviewed changes

wanghan-iapcm merged commit f40e14e into deepmodeling:devel Nov 23, 2021

		if mixed_prec is not None and outputs_size != 1:
		idt = tf.cast(idt, get_precision(mixed_prec['compute_prec']))

		if self.mixed_prec is not None:
		inputs = tf.cast(inputs, get_precision(self.mixed_prec['compute_prec']))

		Argument("output_prec", str, optional=True, default="float32", doc=doc_output_prec),
		Argument("compute_prec", str, optional=False, default="float16", doc=doc_compute_prec),

Uh oh!

Conversation

denghuilu commented Nov 15, 2021

Uh oh!

njzjz commented Nov 15, 2021

Uh oh!

codecov-commenter commented Nov 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njzjz Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amcadmus commented Nov 16, 2021

Uh oh!

njzjz commented Nov 16, 2021

Uh oh!

denghuilu commented Nov 17, 2021

Uh oh!

denghuilu commented Nov 21, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njzjz commented Nov 21, 2021

Uh oh!

denghuilu commented Nov 21, 2021

Uh oh!

njzjz commented Nov 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Nov 15, 2021 •

edited

Loading

njzjz Nov 17, 2021 •

edited

Loading