diff --git a/doc/model/dplr.md b/doc/model/dplr.md
index 1fb34fca46..14db3facfc 100644
--- a/doc/model/dplr.md
+++ b/doc/model/dplr.md
@@ -22,7 +22,7 @@ Two settings make the training input script different from an energy training in
 	    "seed":		1
 	},
 ```
-The type of fitting is set to `"dipole"`. The dipole is associate to type 0 atoms (oxygens), by the setting `"dipole_type": [0]`. What we trained is the displacement of the WC from the corresponding oxygen atom. It shares the same training input as atomic dipole because both are 3-dimensional vectors defined on atoms. 
+The type of fitting is set to {ref}`dipole <model/fitting_net[dipole]>`. The dipole is associate to type 0 atoms (oxygens), by the setting `"dipole_type": [0]`. What we trained is the displacement of the WC from the corresponding oxygen atom. It shares the same training input as atomic dipole because both are 3-dimensional vectors defined on atoms. 
 The loss section is provided as follows
 ```json
     "loss": {
@@ -51,7 +51,7 @@ The training of the DPLR model is very similar to the standard short-range DP mo
             "ewald_beta":       0.40
         },
 ```
-The `"model_name"` specifies which DW model is used to predict the position of WCs. `"model_charge_map"` gives the amount of charge assigned to WCs. `"sys_charge_map"` provides the nuclear charge of oxygen (type 0) and hydrogen (type 1) atoms. `"ewald_beta"` (unit $\text{Å}^{-1}$) gives the spread parameter controls the spread of Gaussian charges, and `"ewald_h"`  (unit Å) assigns the grid size of Fourier transform. 
+The {ref}`model_name <model/modifier[dipole_charge]/model_name>` specifies which DW model is used to predict the position of WCs. {ref}`model_charge_map <model/modifier[dipole_charge]/model_charge_map>` gives the amount of charge assigned to WCs. {ref}`sys_charge_map <model/modifier[dipole_charge]/sys_charge_map>` provides the nuclear charge of oxygen (type 0) and hydrogen (type 1) atoms. {ref}`ewald_beta <model/modifier[dipole_charge]/ewald_beta>` (unit $\text{Å}^{-1}$) gives the spread parameter controls the spread of Gaussian charges, and {ref}`ewald_h <model/modifier[dipole_charge]/ewald_h>`  (unit Å) assigns the grid size of Fourier transform. 
 The DPLR model can be trained and frozen by (from the example directory)
 ```
 dp train ener.json && dp freeze -o ener.pb
diff --git a/doc/model/dprc.md b/doc/model/dprc.md
index 47f1e63fe8..2e4c2220e8 100644
--- a/doc/model/dprc.md
+++ b/doc/model/dprc.md
@@ -54,7 +54,7 @@ As described in the paper, the DPRc model only corrects $E_\text{QM}$ and $E_\te
 }
 ```
 
-`exclude_types` can be generated by the following Python script:
+{ref}`exclude_types <model/descriptor[se_e2_a]/exclude_types>` can be generated by the following Python script:
 ```py
 from itertools import combinations_with_replacement, product
 qm = (0, 1, 3, 5)
@@ -63,7 +63,7 @@ print("QM/QM:", list(map(list, list(combinations_with_replacement(mm, 2)) + list
 print("QM/MM:", list(map(list, list(combinations_with_replacement(qm, 2)) + list(combinations_with_replacement(mm, 2)))))
 ```
 
-Also, DPRc assumes MM atom energies (`atom_ener`) are zero:
+Also, DPRc assumes MM atom energies ({ref}`atom_ener <model/fitting_net[ener]/atom_ener>`) are zero:
 
 ```json
 "fitting_net": {
@@ -73,7 +73,7 @@ Also, DPRc assumes MM atom energies (`atom_ener`) are zero:
 }
 ```
 
-Note that `atom_ener` only works when `descriptor/set_davg_zero` is `true`.
+Note that {ref}`atom_ener <model/fitting_net[ener]/atom_ener>` only works when {ref}`descriptor/set_davg_zero <model/descriptor[se_e2_a]/set_davg_zero>` is `true`.
 
 ## Run MD simulations
 
diff --git a/doc/model/overall.md b/doc/model/overall.md
index bd1986f2ea..fd043a162b 100644
--- a/doc/model/overall.md
+++ b/doc/model/overall.md
@@ -1,6 +1,6 @@
 # Overall
 
-A model has two parts, a descriptor that maps atomic configuration to a set of symmetry invariant features, and a fitting net that takes descriptor as input and predicts the atomic contribution to the target physical property. It's defined in the `model` section of the `input.json`, for example,
+A model has two parts, a descriptor that maps atomic configuration to a set of symmetry invariant features, and a fitting net that takes descriptor as input and predicts the atomic contribution to the target physical property. It's defined in the {ref}`model <model>` section of the `input.json`, for example,
 ```json
     "model": {
         "type_map":	["O", "H"],
@@ -12,9 +12,9 @@ A model has two parts, a descriptor that maps atomic configuration to a set of s
         }
     }
 ```
-The two subsections, `descriptor` and `fitting_net`, define the descriptor and the fitting net, respectively.
+The two subsections, {ref}`descriptor <model/descriptor>` and {ref}`fitting_net <model/fitting_net>`, define the descriptor and the fitting net, respectively.
 
-The `type_map` is optional, which provides the element names (but not necessarily same with the actual name of the element) of the corresponding atom types. A model for water, as in this example, has two kinds of atoms. The atom types are internally recorded as integers, e.g., `0` for oxygen and `1` for hydrogen here. A mapping from the atom type to their names is provided by `type_map`. 
+The {ref}`type_map <model/type_map>` is optional, which provides the element names (but not necessarily same with the actual name of the element) of the corresponding atom types. A model for water, as in this example, has two kinds of atoms. The atom types are internally recorded as integers, e.g., `0` for oxygen and `1` for hydrogen here. A mapping from the atom type to their names is provided by {ref}`type_map <model/type_map>`. 
 
 DeePMD-kit implements the following descriptors:
 1. [`se_e2_a`](train-se-e2-a.md): DeepPot-SE constructed from all information (both angular and radial) of atomic configurations. The embedding takes the distance between atoms as input.
diff --git a/doc/model/train-energy.md b/doc/model/train-energy.md
index fb69c9d9aa..cbe6ad1801 100644
--- a/doc/model/train-energy.md
+++ b/doc/model/train-energy.md
@@ -4,7 +4,7 @@ In this section, we will take `$deepmd_source_dir/examples/water/se_e2_a/input.j
 
 ## The fitting network
 
-The construction of the fitting net is give by section `fitting_net`
+The construction of the fitting net is give by section {ref}`fitting_net <model/fitting_net>`
 ```json
 	"fitting_net" : {
 	    "neuron":		[240, 240, 240],
@@ -12,9 +12,9 @@ The construction of the fitting net is give by section `fitting_net`
 	    "seed":		1
 	},
 ```
-* `neuron` specifies the size of the fitting net. If two neighboring layers are of the same size, then a [ResNet architecture](https://arxiv.org/abs/1512.03385) is built between them. 
-* If the option `resnet_dt` is set `true`, then a timestep is used in the ResNet. 
-* `seed` gives the random seed that is used to generate random numbers when initializing the model parameters.
+* {ref}`neuron <model/fitting_net[ener]/neuron>` specifies the size of the fitting net. If two neighboring layers are of the same size, then a [ResNet architecture](https://arxiv.org/abs/1512.03385) is built between them. 
+* If the option {ref}`resnet_dt <model/fitting_net[ener]/resnet_dt>` is set to `true`, then a timestep is used in the ResNet. 
+* {ref}`seed <model/fitting_net[ener]/seed>` gives the random seed that is used to generate random numbers when initializing the model parameters.
 
 ## Loss
 
@@ -26,12 +26,12 @@ where $L_e$, $L_f$, and $L_v$ denote the loss in energy, force and virial, respe
 
 $$p_f(t) = p_f^0 \frac{ \alpha(t) }{ \alpha(0) } + p_f^\infty ( 1 - \frac{ \alpha(t) }{ \alpha(0) })$$
 
-where $\alpha(t)$ denotes the learning rate at step $t$. $p_f^0$ and $p_f^\infty$ specifies the $p_f$ at the start of the training and at the limit of $t \to \infty$ (set by `start_pref_f` and `limit_pref_f`, respectively), i.e.
+where $\alpha(t)$ denotes the learning rate at step $t$. $p_f^0$ and $p_f^\infty$ specifies the $p_f$ at the start of the training and at the limit of $t \to \infty$ (set by {ref}`start_pref_f <loss[ener]/start_pref_f>` and {ref}`limit_pref_f <loss[ener]/limit_pref_f>`, respectively), i.e.
 ```math
 pref_f(t) = start_pref_f * ( lr(t) / start_lr ) + limit_pref_f * ( 1 - lr(t) / start_lr )
 ```
 
-The `loss` section in the `input.json` is 
+The {ref}`loss <loss>` section in the `input.json` is 
 ```json
     "loss" : {
 	"start_pref_e":	0.02,
@@ -42,6 +42,6 @@ The `loss` section in the `input.json` is
 	"limit_pref_v":	0
     }
 ```
-The options `start_pref_e`, `limit_pref_e`, `start_pref_f`, `limit_pref_f`, `start_pref_v` and `limit_pref_v` determine the start and limit prefactors of energy, force and virial, respectively.
+The options {ref}`start_pref_e <loss[ener]/start_pref_e>`, {ref}`limit_pref_e <loss[ener]/limit_pref_e>`, {ref}`start_pref_f <loss[ener]/start_pref_f>`, {ref}`limit_pref_f <loss[ener]/limit_pref_f>`, {ref}`start_pref_v <start_pref_v>` and {ref}`limit_pref_v <limit_pref_v>` determine the start and limit prefactors of energy, force and virial, respectively.
 
-If one does not want to train with virial, then he/she may set the virial prefactors `start_pref_v` and `limit_pref_v` to 0.
+If one does not want to train with virial, then he/she may set the virial prefactors {ref}`start_pref_v <start_pref_v>` and {ref}`limit_pref_v <limit_pref_v>` to 0.
diff --git a/doc/model/train-fitting-tensor.md b/doc/model/train-fitting-tensor.md
index fde557a073..6d48c34c86 100644
--- a/doc/model/train-fitting-tensor.md
+++ b/doc/model/train-fitting-tensor.md
@@ -9,11 +9,11 @@ $deepmd_source_dir/examples/water_tensor/polar/polar_input.json
 
 The training and validation data are also provided our examples. But note that **the data provided along with the examples are of limited amount, and should not be used to train a production model.**
 
-Similar to the `input.json` used in `ener` mode, training json is also divided into `model`, `learning_rate`, `loss` and `training`. Most keywords remains the same as `ener` mode, and their meaning can be found [here](train-se-e2-a.md). To fit a tensor, one need to modify `model.fitting_net` and `loss`.
+Similar to the `input.json` used in `ener` mode, training json is also divided into {ref}`model <model>`, {ref}`learning_rate <learning_rate>`, {ref}`loss <loss>` and {ref}`training <training>`. Most keywords remains the same as `ener` mode, and their meaning can be found [here](train-se-e2-a.md). To fit a tensor, one need to modify {ref}`model/fitting_net` and {ref}`loss <loss>`.
 
 ## The fitting Network
 
-The `fitting_net` section tells DP which fitting net to use.
+The {ref}`fitting_net <model/fitting_net>` section tells DP which fitting net to use.
 
 The json of `dipole` type should be provided like
 
@@ -47,7 +47,7 @@ The json of `polar` type should be provided like
 
 DP supports a combinational training of global system (only a global `tensor` label, i.e. dipole or polar, is provided in a frame) and atomic system (labels for **each** atom included in `sel_type` are provided). In a global system, each frame has just **one** `tensor` label. For example, when fitting `polar`, each frame will just provide a `1 x 9` vector which gives the elements of the polarizability tensor of that frame in order XX, XY, XZ, YX, YY, YZ, XZ, ZY, ZZ. By contrast, in a atomic system, each atom in `sel_type` has a `tensor` label. For example, when fitting dipole, each frame will provide a `#sel_atom x 3` matrix, where `#sel_atom` is the number of atoms whose type are in `sel_type`.
 
-The `loss` section tells DP the weight of this two kind of loss, i.e.
+The {ref}`loss <loss>` section tells DP the weight of this two kind of loss, i.e.
 
 ```python
 loss = pref * global_loss + pref_atomic * atomic_loss
@@ -63,8 +63,8 @@ The loss section should be provided like
 	},
 ```
 
--   `type` should be written as `tensor` as a distinction from `ener` mode.
--   `pref` and `pref_atomic` respectively specify the weight of global loss and atomic loss. It can not be left unset. If set to 0, system with corresponding label will NOT be included in the training process.
+-   {ref}`type <loss/type>` should be written as `tensor` as a distinction from `ener` mode.
+-   {ref}`pref <loss[tensor]/pref>` and {ref}`pref_atomic <loss[tensor]/pref_atomic>` respectively specify the weight of global loss and atomic loss. It can not be left unset. If set to 0, system with corresponding label will NOT be included in the training process.
 
 ## Training Data Preparation
 
diff --git a/doc/model/train-hybrid.md b/doc/model/train-hybrid.md
index 7383d5c08b..b69b49ea21 100644
--- a/doc/model/train-hybrid.md
+++ b/doc/model/train-hybrid.md
@@ -2,7 +2,7 @@
 
 This descriptor hybridize multiple descriptors to form a new descriptor. For example we have a list of descriptor denoted by $\mathcal D_1$, $\mathcal D_2$, ..., $\mathcal D_N$, the hybrid descriptor this the concatenation of the list, i.e. $\mathcal D = (\mathcal D_1, \mathcal D_2, \cdots, \mathcal D_N)$.
 
-To use the descriptor in DeePMD-kit, one firstly set the `type` to `"hybrid"`, then provide the definitions of the descriptors by the items in the `list`,
+To use the descriptor in DeePMD-kit, one firstly set the {ref}`type <model/descriptor/type>` to {ref}`hybrid <model/descriptor[hybrid]>`, then provide the definitions of the descriptors by the items in the `list`,
 ```json
         "descriptor" :{
             "type": "hybrid",
diff --git a/doc/model/train-se-e2-a-tebd.md b/doc/model/train-se-e2-a-tebd.md
index 2132bd4df2..c80127939d 100644
--- a/doc/model/train-se-e2-a-tebd.md
+++ b/doc/model/train-se-e2-a-tebd.md
@@ -2,10 +2,10 @@
  
 We generate specific type embedding vector for each atom type, so that we can share one descriptor embedding net and one fitting net in total, which decline training complexity largely. 
 
-The training input script is similar to that of [`se_e2_a`](train-se-e2-a.md), but different by adding the `type_embedding` section. 
+The training input script is similar to that of [`se_e2_a`](train-se-e2-a.md), but different by adding the {ref}`type_embedding <model/type_embedding>` section. 
 
 ## Type embedding net
-The `model` defines how the model is constructed, adding a section of type embedding net:
+The {ref}`model <model>` defines how the model is constructed, adding a section of type embedding net:
 ```json
     "model": {
 	"type_map":	["O", "H"],
@@ -22,7 +22,7 @@ The `model` defines how the model is constructed, adding a section of type embed
 ```
 Model will automatically apply type embedding approach and generate type embedding vectors. If type embedding vector is detected, descriptor and fitting net would take it as a part of input.
 
-The construction of type embedding net is given by `type_embedding`. An example of `type_embedding` is provided as follows
+The construction of type embedding net is given by {ref}`type_embedding <model/type_embedding>`. An example of {ref}`type_embedding <model/type_embedding>` is provided as follows
 ```json
 	"type_embedding":{
 	    "neuron":		[2, 4, 8],
@@ -30,9 +30,9 @@ The construction of type embedding net is given by `type_embedding`. An example
 	    "seed":		1
 	}
 ```
-* The `neuron` specifies the size of the type embedding net. From left to right the members denote the sizes of each hidden layer from input end to the output end, respectively. It takes one-hot vector as input and output dimension equals to the last dimension of the `neuron` list. If the outer layer is of twice size as the inner layer, then the inner layer is copied and concatenated, then a [ResNet architecture](https://arxiv.org/abs/1512.03385) is built between them.
-* If the option `resnet_dt` is set `true`, then a timestep is used in the ResNet.
-* `seed` gives the random seed that is used to generate random numbers when initializing the model parameters.
+* The {ref}`neuron <model/type_embedding/neuron>` specifies the size of the type embedding net. From left to right the members denote the sizes of each hidden layer from input end to the output end, respectively. It takes one-hot vector as input and output dimension equals to the last dimension of the {ref}`neuron <model/type_embedding/neuron>` list. If the outer layer is of twice size as the inner layer, then the inner layer is copied and concatenated, then a [ResNet architecture](https://arxiv.org/abs/1512.03385) is built between them.
+* If the option {ref}`resnet_dt <model/type_embedding/resnet_dt>` is set to `true`, then a timestep is used in the ResNet.
+* {ref}`seed <model/type_embedding/seed>` gives the random seed that is used to generate random numbers when initializing the model parameters.
 
 
 A complete training input script of this example can be find in the directory. 
diff --git a/doc/model/train-se-e2-a.md b/doc/model/train-se-e2-a.md
index c5c3644f15..2a28bf9658 100644
--- a/doc/model/train-se-e2-a.md
+++ b/doc/model/train-se-e2-a.md
@@ -8,7 +8,7 @@ $deepmd_source_dir/examples/water/se_e2_a/input.json
 ```
 With the training input script, data are also provided in the example directory. One may train the model with the DeePMD-kit from the directory.
 
-The construction of the descriptor is given by section `descriptor`. An example of the descriptor is provided as follows
+The construction of the descriptor is given by section {ref}`descriptor <model/descriptor>`. An example of the descriptor is provided as follows
 ```json
 	"descriptor" :{
 	    "type":		"se_e2_a",
@@ -22,12 +22,12 @@ The construction of the descriptor is given by section `descriptor`. An example
 	    "seed":		1
 	}
 ```
-* The `type` of the descriptor is set to `"se_e2_a"`. 
-* `rcut` is the cut-off radius for neighbor searching, and the `rcut_smth` gives where the smoothing starts. 
-* `sel` gives the maximum possible number of neighbors in the cut-off radius. It is a list, the length of which is the same as the number of atom types in the system, and `sel[i]` denote the maximum possible number of neighbors with type `i`. 
-* The `neuron` specifies the size of the embedding net. From left to right the members denote the sizes of each hidden layer from input end to the output end, respectively. If the outer layer is of twice size as the inner layer, then the inner layer is copied and concatenated, then a [ResNet architecture](https://arxiv.org/abs/1512.03385) is built between them.
-* If the option `type_one_side` is set to `true`, then descriptor will consider the types of neighbor atoms. Otherwise, both the types of centric and  neighbor atoms are considered.
-* The `axis_neuron` specifies the size of submatrix of the embedding matrix, the axis matrix as explained in the [DeepPot-SE paper](https://arxiv.org/abs/1805.09003) 
-* If the option `resnet_dt` is set `true`, then a timestep is used in the ResNet.
-* `seed` gives the random seed that is used to generate random numbers when initializing the model parameters.
+* The {ref}`type <model/descriptor/type>` of the descriptor is set to `"se_e2_a"`. 
+* {ref}`rcut <model/descriptor[se_e2_a]/rcut>` is the cut-off radius for neighbor searching, and the {ref}`rcut_smth <model/descriptor[se_e2_a]/rcut_smth>` gives where the smoothing starts. 
+* {ref}`sel <model/descriptor[se_e2_a]/sel>` gives the maximum possible number of neighbors in the cut-off radius. It is a list, the length of which is the same as the number of atom types in the system, and `sel[i]` denote the maximum possible number of neighbors with type `i`. 
+* The {ref}`neuron <model/descriptor[se_e2_a]/neuron>` specifies the size of the embedding net. From left to right the members denote the sizes of each hidden layer from input end to the output end, respectively. If the outer layer is of twice size as the inner layer, then the inner layer is copied and concatenated, then a [ResNet architecture](https://arxiv.org/abs/1512.03385) is built between them.
+* If the option {ref}`type_one_side <model/descriptor[se_e2_a]/type_one_side>` is set to `true`, then descriptor will consider the types of neighbor atoms. Otherwise, both the types of centric and  neighbor atoms are considered.
+* The {ref}`axis_neuron <model/descriptor[se_e2_a]/axis_neuron>` specifies the size of submatrix of the embedding matrix, the axis matrix as explained in the [DeepPot-SE paper](https://arxiv.org/abs/1805.09003) 
+* If the option {ref}`resnet_dt <model/descriptor[se_e2_a]/resnet_dt>` is set to `true`, then a timestep is used in the ResNet.
+* {ref}`seed <model/descriptor[se_e2_a]/seed>` gives the random seed that is used to generate random numbers when initializing the model parameters.
 
diff --git a/doc/model/train-se-e2-r.md b/doc/model/train-se-e2-r.md
index 2e0ee4a2c2..181146e8e9 100644
--- a/doc/model/train-se-e2-r.md
+++ b/doc/model/train-se-e2-r.md
@@ -7,7 +7,7 @@ A complete training input script of this example can be found in the directory
 $deepmd_source_dir/examples/water/se_e2_r/input.json
 ```
 
-The training input script is very similar to that of [`se_e2_a`](train-se-e2-a.md). The only difference lies in the `descriptor` section
+The training input script is very similar to that of [`se_e2_a`](train-se-e2-a.md). The only difference lies in the {ref}`descriptor <model/descriptor>` section
 ```json
 	"descriptor": {
 	    "type":		"se_e2_r",
@@ -20,4 +20,4 @@ The training input script is very similar to that of [`se_e2_a`](train-se-e2-a.m
 	    "_comment": " that's all"
 	},
 ```
-The type of the descriptor is set by the key `"type"`.
+The type of the descriptor is set by the key {ref}`type <model/descriptor/type>`.
diff --git a/doc/model/train-se-e3.md b/doc/model/train-se-e3.md
index a47d1680b9..d59f11b264 100644
--- a/doc/model/train-se-e3.md
+++ b/doc/model/train-se-e3.md
@@ -7,7 +7,7 @@ A complete training input script of this example can be found in the directory
 $deepmd_source_dir/examples/water/se_e3/input.json
 ```
 
-The training input script is very similar to that of [`se_e2_a`](train-se-e2-a.md). The only difference lies in the `descriptor` section
+The training input script is very similar to that of [`se_e2_a`](train-se-e2-a.md). The only difference lies in the `descriptor <model/descriptor>` section
 ```json
 	"descriptor": {
 	    "type":		"se_e3",
@@ -20,4 +20,4 @@ The training input script is very similar to that of [`se_e2_a`](train-se-e2-a.m
 	    "_comment":		" that's all"
 	},
 ```
-The type of the descriptor is set by the key `"type"`.
+The type of the descriptor is set by the key {ref}`type <model/descriptor/type>`.
diff --git a/doc/train/parallel-training.md b/doc/train/parallel-training.md
index 7fecd364c2..c8d3d29aad 100644
--- a/doc/train/parallel-training.md
+++ b/doc/train/parallel-training.md
@@ -5,11 +5,11 @@ Depend on the number of training processes (according to MPI context) and number
 
 ## Tuning learning rate
 
-Horovod works in the data-parallel mode, resulting in a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you launch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
+Horovod works in the data-parallel mode, resulting in a larger global batch size. For example, the real batch size is 8 when {ref}`batch_size <training/training_data/batch_size>` is set to 2 in the input file and you launch 4 workers. Thus, {ref}`learning_rate <learning_rate>` is automatically scaled by the number of workers for better convergence. Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
 
 The number of decay steps required to achieve same accuracy can decrease by the number of cards (e.g., 1/2 of steps in the above case), but needs to be scaled manually in the input file.
 
-In some cases, it won't work well when scale learning rate by worker count in a `linear` way. Then you can try `sqrt` or `none` by setting argument `scale_by_worker` like below.
+In some cases, it won't work well when scale learning rate by worker count in a `linear` way. Then you can try `sqrt` or `none` by setting argument {ref}`scale_by_worker <learning_rate/scale_by_worker>` like below.
 ```json
     "learning_rate" :{
         "scale_by_worker": "none",
diff --git a/doc/train/tensorboard.md b/doc/train/tensorboard.md
index aa92bfaaab..17b0384d66 100644
--- a/doc/train/tensorboard.md
+++ b/doc/train/tensorboard.md
@@ -19,7 +19,7 @@ DeePMD-kit can now use most of the interesting features enabled by tensorboard!
 ## How to use Tensorboard with DeePMD-kit
 
 Before running TensorBoard, make sure you have generated summary data in a log
-directory by modifying the the input script, set "tensorboard" true in training
+directory by modifying the the input script, set {ref}`tensorboard <training/tensorboard>` to true in training
 subsection will enable the tensorboard data analysis. eg. **water_se_a.json**.
 
 ```json
diff --git a/doc/train/training-advanced.md b/doc/train/training-advanced.md
index 74998f82a7..98e12b4773 100644
--- a/doc/train/training-advanced.md
+++ b/doc/train/training-advanced.md
@@ -4,7 +4,7 @@ In this section, we will take `$deepmd_source_dir/examples/water/se_e2_a/input.j
 
 ## Learning rate
 
-The `learning_rate` section in `input.json` is given as follows
+The {ref}`learning_rate <learning_rate>` section in `input.json` is given as follows
 ```json
     "learning_rate" :{
 	"type":		"exp",
@@ -14,13 +14,13 @@ The `learning_rate` section in `input.json` is given as follows
 	"_comment":	"that's all"
     }
 ```
-* `start_lr` gives the learning rate at the beginning of the training.
-* `stop_lr` gives the learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. 
-* During the training, the learning rate decays exponentially from `start_lr` to `stop_lr` following the formula:
+* {ref}`start_lr <learning_rate[exp]/start_lr>` gives the learning rate at the beginning of the training.
+* {ref}`stop_lr <learning_rate[exp]/stop_lr>` gives the learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. 
+* During the training, the learning rate decays exponentially from {ref}`start_lr <learning_rate[exp]/start_lr>` to {ref}`stop_lr <learning_rate[exp]/stop_lr>` following the formula:
 
 $$ \alpha(t) = \alpha_0 \lambda ^ { t / \tau } $$
 
-where $t$ is the training step, $\alpha$ is the learning rate, $\alpha_0$ is the starting learning rate (set by `start_lr`), $\lambda$ is the decay rate, and $\tau$ is the decay steps, i.e.
+where $t$ is the training step, $\alpha$ is the learning rate, $\alpha_0$ is the starting learning rate (set by {ref}`start_lr <learning_rate[exp]/start_lr>`), $\lambda$ is the decay rate, and $\tau$ is the decay steps, i.e.
 
     ```
     lr(t) = start_lr * decay_rate ^ ( t / decay_steps )
@@ -28,7 +28,7 @@ where $t$ is the training step, $\alpha$ is the learning rate, $\alpha_0$ is the
 
 ## Training parameters
 
-Other training parameters are given in the `training` section.
+Other training parameters are given in the {ref}`training <training>` section.
 ```json
     "training": {
  	"training_data": {
@@ -45,18 +45,18 @@ Other training parameters are given in the `training` section.
 	    "compute_prec":     "float16"
 	},
 
-	"numb_step":	1000000,
+	"numb_steps":	1000000,
 	"seed":		1,
 	"disp_file":	"lcurve.out",
 	"disp_freq":	100,
 	"save_freq":	1000
     }
 ```
-The sections `"training_data"` and `"validation_data"` give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:
-* `systems` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
-    * `list`: `systems` gives the training data systems.
-    * `str`: `systems` should be a valid path. DeePMD-kit will recursively search all data systems in this path.
-* At each training step, DeePMD-kit randomly pick `batch_size` frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More optional are available for automatically determining the probability of using systems. One can set the key `auto_prob` to
+The sections {ref}`training_data <training/training_data>` and {ref}`validation_data <training/validation_data>` give the training dataset and validation dataset, respectively. Taking the training dataset for example, the keys are explained below:
+* {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
+    * `list`: {ref}`systems <training/training_data/systems>` gives the training data systems.
+    * `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path.
+* At each training step, DeePMD-kit randomly pick {ref}`batch_size <training/training_data/batch_size>` frame(s) from one of the systems. The probability of using a system is by default in proportion to the number of batches in the system. More optional are available for automatically determining the probability of using systems. One can set the key {ref}`auto_prob <training/training_data/auto_prob>` to
     * `"prob_uniform"` all systems are used with the same probability.
     * `"prob_sys_size"` the probability of using a system is in proportional to its size (number of frames).
     * `"prob_sys_size; sidx_0:eidx_0:w_0; sidx_1:eidx_1:w_1;..."` the `list` of systems are divided into blocks. The block `i` has systems ranging from `sidx_i` to `eidx_i`. The probability of using a system from block `i` is in proportional to `w_i`. Within one block, the probability of using a system is in proportional to its size.
@@ -68,34 +68,34 @@ The sections `"training_data"` and `"validation_data"` give the training dataset
 	    "batch_size":	"auto"
 	}
 ```
-* The probability of using systems can also be specified explicitly with key `"sys_prob"` that is a list having the length of the number of systems. For example
+* The probability of using systems can also be specified explicitly with key {ref}`sys_probs <training/training_data/sys_probs>` that is a list having the length of the number of systems. For example
 ```json
  	"training_data": {
 	    "systems":		["../data_water/data_0/", "../data_water/data_1/", "../data_water/data_2/"],
-	    "sys_prob":	[0.5, 0.3, 0.2],
+	    "sys_probs":	[0.5, 0.3, 0.2],
 	    "batch_size":	"auto:32"
 	}
 ```
-* The key `batch_size` specifies the number of frames used to train or validate the model in a training step. It can be set to
-    * `list`: the length of which is the same as the `systems`. The batch size of each system is given by the elements of the list.
+* The key {ref}`batch_size <training/training_data/batch_size>` specifies the number of frames used to train or validate the model in a training step. It can be set to
+    * `list`: the length of which is the same as the {ref}`systems`. The batch size of each system is given by the elements of the list.
     * `int`: all systems use the same batch size.
     * `"auto"`: the same as `"auto:32"`, see `"auto:N"`
-    * `"auto:N"`: automatically determines the batch size so that the `batch_size` times the number of atoms in the system is no less than `N`.
-* The key `numb_batch` in `validate_data` gives the number of batches of model validation. Note that the batches may not be from the same system
+    * `"auto:N"`: automatically determines the batch size so that the {ref}`batch_size <training/training_data/batch_size>` times the number of atoms in the system is no less than `N`.
+* The key {ref}`numb_batch <training/validation_data/numb_btch>` in {ref}`validate_data <training/validation_data>` gives the number of batches of model validation. Note that the batches may not be from the same system
 
-The section `mixed_precision` specifies the mixed precision settings, which will enable the mixed precision training workflow for deepmd-kit. The keys are explained below:
-* `output_prec`  precision used in the output tensors, only `float32` is supported currently.
-* `compute_prec` precision used in the computing tensors, only `float16` is supported currently.
+The section {ref}`mixed_precision <training/mixed_precision>` specifies the mixed precision settings, which will enable the mixed precision training workflow for deepmd-kit. The keys are explained below:
+* {ref}`output_prec <training/mixed_precision/output_prec>`  precision used in the output tensors, only `float32` is supported currently.
+* {ref}`compute_prec <training/mixed_precision/compute_prec>` precision used in the computing tensors, only `float16` is supported currently.
 Note there are severial limitations about the mixed precision training:
-* Only 'se_e2_a' type descriptor is supported by the mixed precision training workflow.
+* Only {ref}`se_e2_a <model/descriptor[se_e2_a]>` type descriptor is supported by the mixed precision training workflow.
 * The precision of embedding net and fitting net are forced to be set to `float32`.
 
-Other keys in the `training` section are explained below:
-* `numb_step` The number of training steps.
-* `seed` The random seed for getting frames from the training data set.
-* `disp_file` The file for printing learning curve.
-* `disp_freq` The frequency of printing learning curve. Set in the unit of training steps
-* `save_freq` The frequency of saving check point.
+Other keys in the {ref}`training <training>` section are explained below:
+* {ref}`numb_steps <training/numb_steps>` The number of training steps.
+* {ref}`seed <training/seed>` The random seed for getting frames from the training data set.
+* {ref}`disp_file <training/disp_file>` The file for printing learning curve.
+* {ref}`disp_freq <training/disp_freq>` The frequency of printing learning curve. Set in the unit of training steps
+* {ref}`save_freq <training/save_freq>` The frequency of saving check point.
 
 ## Options and environment variables
 
diff --git a/doc/train/training.md b/doc/train/training.md
index 5010740538..1183e03b81 100644
--- a/doc/train/training.md
+++ b/doc/train/training.md
@@ -26,9 +26,9 @@ DEEPMD INFO                                        system  natoms  bch_sz   n_bc
 DEEPMD INFO                          ../data_water/data_3     192       1      80  1.000    T
 DEEPMD INFO    --------------------------------------------------------------------------------------
 ```
-The DeePMD-kit prints detailed information on the training and validation data sets. The data sets are defined by `"training_data"` and `"validation_data"` defined in the `"training"` section of the input script. The training data set is composed by three data systems, while the validation data set is composed by one data system. The number of atoms, batch size, number of batches in the system and the probability of using the system are all shown on the screen. The last column presents if the periodic boundary condition is assumed for the system. 
+The DeePMD-kit prints detailed information on the training and validation data sets. The data sets are defined by {ref}`training_data <training/training_data>` and {ref}`validation_data <training/validation_data>` defined in the {ref}`training <training>` section of the input script. The training data set is composed by three data systems, while the validation data set is composed by one data system. The number of atoms, batch size, number of batches in the system and the probability of using the system are all shown on the screen. The last column presents if the periodic boundary condition is assumed for the system. 
 
-During the training, the error of the model is tested every `disp_freq` training steps with the batch used to train the model and with `numb_btch` batches from the validating data. The training error and validation error are printed correspondingly in the file `disp_file` (default is `lcurve.out`). The batch size can be set in the input script by the key `batch_size` in the corresponding sections for training and validation data set. An example of the output 
+During the training, the error of the model is tested every {ref}`disp_freq <training/disp_freq>` training steps with the batch used to train the model and with {ref}`numb_btch <training/validation_data/numb_btch>` batches from the validating data. The training error and validation error are printed correspondingly in the file {ref}`disp_file <training/disp_file>` (default is `lcurve.out`). The batch size can be set in the input script by the key {ref}`batch_size <training/training_data/batch_size>` in the corresponding sections for training and validation data set. An example of the output 
 ```bash
 #  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
       0      3.33e+01    3.41e+01      1.03e+01    1.03e+01      8.39e-01    8.72e-01    1.0e-03
@@ -56,7 +56,7 @@ plt.grid()
 plt.show()
 ```
 
-Checkpoints will be written to files with prefix `save_ckpt` every `save_freq` training steps. 
+Checkpoints will be written to files with prefix {ref}`save_ckpt <training/save_ckpt>` every {ref}`save_freq <training/save_freq>` training steps. 
 
 ## Warning
 It is warned that the example water data (in folder `examples/water/data`) is of very limited amount, is provided only for testing purpose, and should not be used to train a production model.