diff --git a/README.md b/README.md index 995ac6e914..9d525630f2 100644 --- a/README.md +++ b/README.md @@ -86,7 +86,8 @@ A full [document](doc/train/train-input-auto.rst) on options in the training inp - [Install GROMACS](doc/install/install-gromacs.md) - [Building conda packages](doc/install/build-conda.md) - [Data](doc/data/index.md) - - [Data conversion](doc/data/data-conv.md) + - [System](doc/data/system.md) + - [Formats of a system](doc/data/data-conv.md) - [Prepare data with dpdata](doc/data/dpdata.md) - [Model](doc/model/index.md) - [Overall](doc/model/overall.md) diff --git a/doc/data/data-conv.md b/doc/data/data-conv.md index 6f25e36ba4..d3c0632464 100644 --- a/doc/data/data-conv.md +++ b/doc/data/data-conv.md @@ -1,26 +1,21 @@ -# Data conversion +# Formats of a system -One needs to provide the following information to train a model: the atom type, the simulation box, the atom coordinate, the atom force, system energy and virial. A snapshot of a system that contains these information is called a **frame**. We use the following convention of units: +Two binaray formats, NumPy and HDF5, are supported for training. The raw format is not directly supported, but a tool is provided to convert data from the raw format to the NumPy format. +## NumPy format -Property | Unit ----|--- -Time | ps -Length | Å -Energy | eV -Force | eV/Å -Virial | eV -Pressure | Bar - - -The frames of the system are stored in two formats. A raw file is a plain text file with each information item written in one file and one frame written on one line. The default files that provide box, coordinate, force, energy and virial are `box.raw`, `coord.raw`, `force.raw`, `energy.raw` and `virial.raw`, respectively. *We recommend you use these file names*. Here is an example of force.raw: -```bash -$ cat force.raw --0.724 2.039 -0.951 0.841 -0.464 0.363 - 6.737 1.554 -5.587 -2.803 0.062 2.222 --1.968 -0.163 1.020 -0.225 -0.789 0.343 +In a system with the Numpy format, the system properties are stored as text files ending with `.raw`, such as `type.raw` amd `type_map.raw`, under the system directory. If one needs to train a non-periodic system, an empty `nopbc` file should be put under the system directory. Both input and labeled frame properties are saved as the [NumPy binary data (NPY) files](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#npy-format) ending with `.npy` in each of the `set.*` directories. Take an example, a system may contain the following files: +``` +type.raw +type_map.raw +nopbc +set.000/coord.npy +set.000/energy.npy +set.000/force.npy +set.001/coord.npy +set.001/energy.npy +set.001/force.npy ``` -This `force.raw` contains 3 frames with each frame having the forces of 2 atoms, thus it has 3 lines and 6 columns. Each line provides all the 3 force components of 2 atoms in 1 frame. The first three numbers are the 3 force components of the first atom, while the second three numbers are the 3 force components of the second atom. The coordinate file `coord.raw` is organized similarly. In `box.raw`, the 9 components of the box vectors should be provided on each line in the order `XX XY XZ YX YY YZ ZX ZY ZZ`. In `virial.raw`, the 9 components of the virial tensor should be provided on each line in the order `XX XY XZ YX YY YZ ZX ZY ZZ`. The number of lines of all raw files should be identical. We assume that the atom types do not change in all frames. It is provided by `type.raw`, which has one line with the types of atoms written one by one. The atom types should be integers. For example the `type.raw` of a system that has 2 atoms with 0 and 1: ```bash @@ -35,7 +30,30 @@ O H ``` The type `0` is named by `"O"` and the type `1` is named by `"H"`. -The second format is the data sets of `numpy` binary data that are directly used by the training program. User can use the script `$deepmd_source_dir/data/raw/raw_to_set.sh` to convert the prepared raw files to data sets. For example, if we have a raw file that contains 6000 frames, +## HDF5 format + +A system with the HDF5 format has the same strucutre as the Numpy format, but in a HDF5 file, a system is organized as an [HDF5 group](https://docs.h5py.org/en/stable/high/group.html). The file name of a Numpy file is the key in a HDF5 file, and the data is the value to the key. One need to use `#` in a DP path to divide the path to the HDF5 file and the HDF5 key: +``` +/path/to/data.hdf5#H2O +``` +Here, `/path/to/data.hdf5` is the path and `H2O` is the key. There should be some data in the `H2O` group, such as `H2O/type.raw` and `H2O/set.000/force.npy`. + +A HDF5 files with a large number of systems has better performance than multiple NumPy files in a large cluster. + +## Raw format and data conversion + +A raw file is a plain text file with each information item written in one file and one frame written on one line. **It's not directly supported**, but we provide a tool to convert them. + +In the raw format, the property of one frame are provided per line, ending with `.raw`. Take an example, the default files that provide box, coordinate, force, energy and virial are `box.raw`, `coord.raw`, `force.raw`, `energy.raw` and `virial.raw`, respectively. Here is an example of `force.raw`: +```bash +$ cat force.raw +-0.724 2.039 -0.951 0.841 -0.464 0.363 + 6.737 1.554 -5.587 -2.803 0.062 2.222 +-1.968 -0.163 1.020 -0.225 -0.789 0.343 +``` +This `force.raw` contains 3 frames with each frame having the forces of 2 atoms, thus it has 3 lines and 6 columns. Each line provides all the 3 force components of 2 atoms in 1 frame. The first three numbers are the 3 force components of the first atom, while the second three numbers are the 3 force components of the second atom. Other files are organized similarly. The number of lines of all raw files should be identical. + +One can use the script `$deepmd_source_dir/data/raw/raw_to_set.sh` to convert the prepared raw files to the NumPy format. For example, if we have a raw file that contains 6000 frames, ```bash $ ls box.raw coord.raw energy.raw force.raw type.raw virial.raw @@ -49,7 +67,4 @@ making set 2 ... $ ls box.raw coord.raw energy.raw force.raw set.000 set.001 set.002 type.raw virial.raw ``` -It generates three sets `set.000`, `set.001` and `set.002`, with each set contains 2000 frames. One do not need to take care of the binary data files in each of the `set.*` directories. The path containing `set.*` and `type.raw` is called a *system*. - -If one needs to train a non-periodic system, an empty `nopbc` file should be put under the system directory. `box.raw` is not necessary in a non-periodic system. - +It generates three sets `set.000`, `set.001` and `set.002`, with each set contains 2000 frames with the Numpy format. diff --git a/doc/data/index.md b/doc/data/index.md index d54f52cd8e..3e3582abf6 100644 --- a/doc/data/index.md +++ b/doc/data/index.md @@ -4,5 +4,6 @@ In this section, we will introduce how to convert the DFT labeled data into the The DeePMD-kit organize data in `systems`. Each `system` is composed by a number of `frames`. One may roughly view a `frame` as a snap short on an MD trajectory, but it does not necessary come from an MD simulation. A `frame` records the coordinates and types of atoms, cell vectors if the periodic boundary condition is assumed, energy, atomic forces and virial. It is noted that the `frames` in one `system` share the same number of atoms with the same type. -- [Data conversion](data-conv.md) +- [System](system.md) +- [Formats of a system](data-conv.md) - [Prepare data with dpdata](dpdata.md) diff --git a/doc/data/index.rst b/doc/data/index.rst index 0631727546..d5fa62648a 100644 --- a/doc/data/index.rst +++ b/doc/data/index.rst @@ -7,5 +7,6 @@ The DeePMD-kit organize data in :code:`systems`. Each :code:`system` is composed .. toctree:: :maxdepth: 1 + system data-conv dpdata diff --git a/doc/data/system.md b/doc/data/system.md new file mode 100644 index 0000000000..b8d318f255 --- /dev/null +++ b/doc/data/system.md @@ -0,0 +1,45 @@ +# System + +DeePMD-kit takes a **system** as data structure. A snapshot of a system is called a **frame**. A system may contain multiple frames with the same atom types and numbers, i.e. the same formula (like `H2O`). To contains data with different formula, one need to divide data into multiple systems. + +A system should contain system properties, input frame properties, and labeled frame properties. The system property contains the following property: + +ID | Property | Raw file | Required/Optional | Shape | Description +-------- | ---------------------- | ------------ | -------------------- | ----------------------- | ----------- +type | Atom type indexes | type.raw | Required | Natoms | Integers that start with 0 +type_map | Atom type names | type_map.raw | Optional | Ntypes | Atom names that map to atom type, which is unnecessart to be contained in the periodic table +nopbc | Non-periodic system | nopbc | Optional | 1 | If True, this system is non-periodic; otherwise it's periodic + +The input frame properties contains the following property, the first axis of which is the number of frames: + +ID | Property | Raw file | Unit | Required/Optional | Shape | Description +-------- | ---------------------- | -------------- | ---- | -------------------- | ----------------------- | ----------- +coord | Atomic coordinates | coord.raw | Å | Required | Nframes \* Natoms \* 3 | +box | Boxes | box.raw | Å | Required if periodic | Nframes \* 3 \* 3 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ` +fparam | Extra frame parameters | fparam.raw | Any | Optional | Nframes \* Any | +aparam | Extra atomic parameters | aparam.raw | Any | Optional | Nframes \* aparam \* Any | + +The labeled frame properties is listed as follows, all of which will be used for training if and only if the loss function contains such property: + +ID | Property | Raw file | Unit | Shape | Description +---------------------- | ----------------------- | ------------------------ | ---- | ----------------------- | ----------- +energy | Frame energies | energy.raw | eV | Nframes | +force | Atomic forces | force.raw | eV/Å | Nframes \* Natoms \* 3 | +virial | Frame virial | virial.raw | eV | Nframes \* 3 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ` +atom_ener | Atomic energies | atom_ener.raw | eV | Nframes \* Natoms | +atom_pref | Weights of atomic forces | atom_pref.raw | 1 | Nframes \* Natoms | +dipole | Frame dipole | dipole.raw | Any | Nframes \* 3 | +atomic_dipole | Atomic dipole | atomic_dipole.raw | Any | Nframes \* Natoms \* 3 | +polarizability | Frame polarizability | polarizability.raw | Any | Nframes \* 9 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ` +atomic_polarizability | Atomic polarizability | atomic_polarizability.raw| Any | Nframes \* Natoms \* 9 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ` + +In general, we always use the following convention of units: + +Property | Unit +---------| ---- +Time | ps +Length | Å +Energy | eV +Force | eV/Å +Virial | eV +Pressure | Bar