Skip to content

Commit ca0fa35

Browse files
authored
upload mofdiff source code.
1 parent 1a45b12 commit ca0fa35

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+321607
-24
lines changed

README.md

Lines changed: 186 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,195 @@
1-
# Project
1+
# MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design
22

3-
> This repo has been populated by an initial template to help get you started. Please
4-
> make sure to update the content to build a great experience for community-building.
3+
`mofdiff` is a diffusion model for generating coarse-grained MOF structures. This codebase also contains the code for deconstructing/reconstructing the all-atom MOF structures to train MOFDiff and assemble CG structures generated by MOFDiff.
54

6-
As the maintainer of this project, please make a few updates:
5+
[paper](https://arxiv.org/abs/2310.10732) | [data and pretained models](https://zenodo.org/uploads/10467288)
76

8-
- Improving this README.MD file to provide a great experience
9-
- Updating SUPPORT.MD with content about this project's support experience
10-
- Understanding the security reporting process in SECURITY.MD
11-
- Remove this section from the README
7+
If you find this code useful, please consider referencing our paper:
128

13-
## Contributing
9+
```
10+
@article{fu2023mofdiff,
11+
title={MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design},
12+
author={Fu, Xiang and Xie, Tian and Rosen, Andrew S and Jaakkola, Tommi and Smith, Jake},
13+
journal={arXiv preprint arXiv:2310.10732},
14+
year={2023}
15+
}
16+
```
1417

15-
This project welcomes contributions and suggestions. Most contributions require you to agree to a
16-
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
17-
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
18+
## Table of Contents
1819

19-
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
20-
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
21-
provided by the bot. You will only need to do this once across all repos using our CLA.
20+
- [Installation](#installation)
21+
- [Dowlnload data](#download-data)
22+
- [Training](#training)
23+
- [Generating MOF structures](#generating-mof-structures)
24+
- [Assemble all-atom MOFs](#assemble-all-atom-mofs)
25+
- [Relax MOFs](#relax-mofs)
2226

23-
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
24-
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
25-
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
27+
## Installation
2628

27-
## Trademarks
29+
We recommend using [mamba](https://mamba.readthedocs.io/en/latest/) (much faster than conda) to install the dependencies. First install `mamba` following the intructions in the [mamba repository](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html).
2830

29-
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
30-
trademarks or logos is subject to and must follow
31-
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
32-
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
33-
Any use of third-party trademarks or logos are subject to those third-party's policies.
31+
32+
Install dependencies via `mamba`:
33+
34+
```
35+
mamba env create -f env.yml
36+
```
37+
38+
Then install `mofdiff` as a package:
39+
40+
```
41+
pip install -e .
42+
```
43+
44+
We use [MOFid](https://github.com/snurr-group/mofid) for preprocessing and analysis. Install MOFid following the instruction in the [MOFid repository](https://github.com/snurr-group/mofid/blob/master/compiling.md). The generative modeling part of this codebase does not depend on MOFid.
45+
46+
## Download data
47+
48+
You can download the preprocessed data from [Zenodo](https://zenodo.org/uploads/10467288) (recommended).
49+
50+
Alternatively, you can download the `BW-DB` raw data from [Materials Cloud](https://archive.materialscloud.org/record/2018.0016/v3) and preprocess the data with the following command (assuming the data is downloaded to `${raw_path}`, this step requires MOFid):
51+
52+
```
53+
python preprocessing/extract_mofid.py --df_path ${raw_path}/all_MOFs_screening_data.csv --cif_path ${raw_path}/cifs --save_path ${raw_path}/mofid
54+
python preprocessing/preprocess.py --dataset_path
55+
python preprocessing/save_to_lmdb.py
56+
```
57+
58+
The preprocessing inovlves 3 steps:
59+
1. Extract the MOFid for all structures (CPU).
60+
2. Construct CG MOF data objects from MOFid deconstruction results (CPU or GPU).
61+
3. Save the CG MOF objects to an LMDB database (relatively fast).
62+
63+
The entire preprocessing process for `BW-DB` may take several days (depending on the CPU/GPU resources).
64+
65+
## Training
66+
67+
First, configure the `.env` file to set correct paths to various directories. An [example](./.env) `.env` file is provided in the repository.
68+
69+
### training the building block encoder
70+
71+
Before training the diffusion model, we need to train the building block encoder. The building block encoder is a graph neural network that encodes the building blocks of MOFs. The building block encoder is trained with the following command:
72+
73+
```
74+
python mofdiff/scripts/train.py --config-name=bb
75+
```
76+
77+
The default output directory is `${oc.env:HYDRA_JOBS}/bb/${expname}/`. `oc.env:HYDRA_JOBS` is configured in `.env`. `expname` is configured in `configs/bb.yaml`. We use [hydra](https://hydra.cc/) for config management. All configs are stored in `configs/` You can override the default output directory with command line arguments. For example:
78+
79+
```
80+
python mofdiff/scripts/train.py --config-name=bb expname=bwdb_bb_dim_64 model.latent_dim=64
81+
```
82+
83+
Logging is done with [wandb](https://wandb.ai/site) by default. You need to login to wandb with `wandb login` before training. The training logs will be saved to the wandb project `mofdiff`. You can also override the wandb project with command line arguments. You can also disable wandb logging by removing the `wandb` entry in the [config](./conf/logging/default.yaml).
84+
85+
### training coarse-grained diffusion model for MOFs
86+
87+
The output directory where the building block encoder is saved: `bb_encoder_path` is needed for training the diffusion model. With the building block encoder trained to convergence, train the CG diffusion model with the following command:
88+
89+
```
90+
python mofdiff/scripts/train.py data.bb_encoder_path=${bb_encoder_path}
91+
```
92+
93+
For BW-DB, training the building block encoder takes roughly 3 days and training the diffusion model takes roughly 5 days on a single NVIDIA V100 GPU.
94+
95+
## Generating CG MOF structures
96+
97+
Pretrained models can be found [here](https://zenodo.org/record/10467288).
98+
99+
With a trained CG diffusion model `${diffusion_model_path}`, generate random CG MOF structures with the following command:
100+
101+
```
102+
python mofdiff/scripts/sample.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path}
103+
```
104+
105+
`${bb_cache_path}` is the path to the building block embedding space, saved at the beginning of CG diffusion model training. To optimize MOF structures for a property (e.g., CO2 adsorption working capacity), use the following command:
106+
107+
```
108+
python mofdiff/scripts/optimize.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path} --data_path ${data_path}
109+
```
110+
111+
Available arguments for `sample.py` and `optimize.py` can be found in the respective files. The generated CG MOF structures will be saved in `${sample_path}=${diffusion_model_path}/${sample_tag}` as `samples.pt`.
112+
113+
The CG structures generated with the diffusion model are not guaranteed to be realizable. We need to assemble the CG structures to recover the all-atom MOF structures. The following sections describe how to assemble the CG MOF structures, and all steps further do not require a GPU.
114+
115+
## Assemble all-atom MOFs
116+
117+
Assembled the CG MOF structures with the following command:
118+
119+
```
120+
python mofdiff/scripts/assemble.py --input ${sample_path}/samples.pt
121+
```
122+
123+
This command will assemble the CG MOF structures in `${sample_path}` and save the assembled MOFs in `${sample_path}/assembled.pt`. The cif files of the assembled MOFs will be saved in `${sample_path}/cif`. If the assembled MOFs came from property-driven optimization, the optimization arguments are saved to `${sample_path}/opt_args.json`.
124+
125+
## Relax MOFs and compute structural properties
126+
127+
The assembled structures may not be physically plausible. These MOF structures are relaxed uses the UFF force field with LAMMPS. LAMMPS is already installed if you have followed the installation instructions in this README. The script for relaxing the MOF structures also compute structural properties (e.g., pore volume, surface area, etc.) with [Zeo++](https://www.zeoplusplus.org/download.html) and the mofids of the generated MOFs with [MOFid](https://github.com/snurr-group/mofid/tree/master). The respective packages should be installed following the instructions in the respective repositories, and the corresponding paths should be added to `.env` before running the following command. Each step should take no more than a few minutes to complete on a single CPU. We use multiprocessing to parallelize the computation.
128+
129+
Relax MOFs and compute structural properties with the following command:
130+
131+
```
132+
python mofdiff/scripts/uff_relax.py --input ${sample_path}
133+
```
134+
135+
This command will relax the assembled MOFs in `${sample_path}/cif` and save the relaxed MOFs in `${sample_path}/relaxed`. The structural properties of the relaxed MOFs will be saved in `${sample_path}/relaxed/zeo_props_relax.json`. The mofids of the relaxed MOFs will be saved in `${sample_path}/mofid`.
136+
137+
138+
## GCMC simulation for gas adsorption
139+
140+
To run GCMC simulations, first install RASPA2 (simulation software) and eGULP (charge calculation software).
141+
142+
RASPA2 can be installed with `pip`:
143+
144+
```
145+
pip install "RASPA2==2.0.4"
146+
```
147+
148+
You may need to install the following Linux dependencies first:
149+
150+
```
151+
apt-get update
152+
apt-get install -yq libgsl0-dev pkg-config libxrender-dev
153+
```
154+
155+
Install [eGULP](https://github.com/danieleongari/egulp) following the instruction in the repository. The following commands install eGULP in `/usr/local/bin/egulp`:
156+
157+
```
158+
mkdir /usr/local/bin/egulp && tar -xf egulp.tar -C /usr/local/bin/egulp
159+
cd /usr/local/bin/egulp/src && make && cd -
160+
```
161+
162+
Then, decompress the [force field parameters](./mofdiff/gcmc/UFF-TraPPe-scaled.tar) to the RASPA directory using the following commands (assuming RASPA2 installed in `RASPA_PATH=PYTHONPATH/site-packages/RASPA2` with `pip`):
163+
164+
```
165+
tar -xf UFF-TraPPe-scaled.tar -C RASPA_PATH/share/raspa/forcefield/UFF-TraPPe
166+
```
167+
168+
Calculate charges for relaxed samples in `${sample_path}` with the following command:
169+
170+
```
171+
python mofdiff/scripts/calculate_charges.py --input ${sample_path}
172+
```
173+
174+
This command will output cif files with charge information under `${sample_path}/mepo_qeq_charges`.
175+
176+
177+
Run GCMC simulations with the following command:
178+
179+
180+
```
181+
python mofdiff/scripts/gcmc_screen.py --input ${sample_path}/mepo_qeq_charges
182+
```
183+
184+
The GCMC simulation results will be saved in `${sample_path}/gcmc/screening_results.json`.
185+
186+
## Acknowledgement
187+
188+
This codebase is based on several existing repositories:
189+
190+
- [CDVAE](https://github.com/txie-93/cdvae)
191+
- [Open catalyst project](https://github.com/Open-Catalyst-Project/ocp)
192+
- [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric)
193+
- [PyTorch](https://github.com/pytorch/pytorch)
194+
- [Lightning](https://github.com/Lightning-AI/pytorch-lightning/)
195+
- [Hydra](https://github.com/facebookresearch/hydra)

conf/bb.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
expname: bwdb_bb
2+
workdir: ${oc.env:HYDRA_JOBS}/bb_models/${expname}
3+
config_for: bb
4+
5+
core:
6+
version: 0.0.1
7+
tags:
8+
- ${now:%Y-%m-%d}
9+
10+
hydra:
11+
run:
12+
dir: ${oc.env:HYDRA_JOBS}/bb/${expname}/
13+
14+
sweep:
15+
dir: ${oc.env:HYDRA_JOBS}/bb/${expname}/
16+
subdir: ${hydra.job.num}_${hydra.job.id}
17+
18+
job:
19+
env_set:
20+
WANDB_START_METHOD: thread
21+
WANDB_DIR: ${oc.env:WANDB_DIR}
22+
23+
defaults:
24+
- data: bwdb_bb
25+
- logging: default
26+
- model: bb
27+
- optim: default
28+
- train: default

conf/data/bwdb_bb.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
name: bwdb_bb
2+
root_path: ${oc.env:DATASET_DIR}
3+
use_type_mapper: true
4+
5+
# max num bbs
6+
max_bbs: 20
7+
max_atoms: 200
8+
max_cps: 20
9+
10+
train_max_steps: 1500000
11+
early_stopping_patience: 100
12+
patience: 10
13+
14+
data_cache_path: ${oc.env:DATASET_DIR}
15+
load_cached: true
16+
save_cached: true
17+
18+
datamodule:
19+
_target_: mofdiff.data.datamodule.DataModule
20+
21+
datasets:
22+
train:
23+
_target_: mofdiff.data.dataset.BBDataset
24+
name: ${data.name}_train
25+
path: ${data.root_path}
26+
max_bbs: ${data.max_bbs}
27+
max_atoms: ${data.max_atoms}
28+
max_cps: ${data.max_cps}
29+
split_file: ${oc.env:PROJECT_ROOT}/splits/train_split.txt
30+
31+
val:
32+
_target_: mofdiff.data.dataset.BBDataset
33+
name: ${data.name}_train
34+
path: ${data.root_path}
35+
max_bbs: ${data.max_bbs}
36+
max_atoms: ${data.max_atoms}
37+
max_cps: ${data.max_cps}
38+
split_file: ${oc.env:PROJECT_ROOT}/splits/val_split.txt
39+
40+
num_workers:
41+
train: 0
42+
val: 0
43+
test: 0
44+
45+
batch_size:
46+
train: 1024
47+
val: 1024
48+
test: 1024
49+
50+
data_transforms: None

conf/data/bwdb_mof.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: bwdb
2+
root_path: ${oc.env:DATASET_DIR}
3+
prop_list:
4+
- working_capacity_vacuum_swing [mmol/g]
5+
- working_capacity_temperature_swing [mmol/g]
6+
logmod: true
7+
num_targets: 2
8+
use_type_mapper: false
9+
bb_encoder_path: ???
10+
11+
lattice_scale_method: scale_length
12+
13+
max_bbs: 20
14+
max_atoms: 200
15+
max_cps: 20
16+
otf_graph: false
17+
18+
train_max_steps: 2000000
19+
early_stopping_patience: 1000
20+
teacher_forcing_max_epoch: 300
21+
patience: 50
22+
23+
data_cache_path: ${oc.env:DATASET_DIR}
24+
load_cached: true
25+
save_cached: true
26+
27+
datamodule:
28+
_target_: mofdiff.data.datamodule.DataModule
29+
bb_encoder_path: ${data.bb_encoder_path}
30+
31+
datasets:
32+
train:
33+
_target_: mofdiff.data.dataset.MOFDataset
34+
name: ${data.name}_train
35+
path: ${data.root_path}
36+
prop_list: ${data.prop_list}
37+
transforms: ${data.data_transforms}
38+
max_bbs: ${data.max_bbs}
39+
max_atoms: ${data.max_atoms}
40+
max_cps: ${data.max_cps}
41+
logmod: ${data.logmod}
42+
split_file: ${oc.env:PROJECT_ROOT}/splits/train_split.txt
43+
44+
val:
45+
_target_: mofdiff.data.dataset.MOFDataset
46+
name: ${data.name}_val
47+
path: ${data.root_path}
48+
prop_list: ${data.prop_list}
49+
transforms: ${data.data_transforms}
50+
max_bbs: ${data.max_bbs}
51+
max_atoms: ${data.max_atoms}
52+
max_cps: ${data.max_cps}
53+
logmod: ${data.logmod}
54+
split_file: ${oc.env:PROJECT_ROOT}/splits/val_split.txt
55+
56+
num_workers:
57+
train: 0
58+
val: 0
59+
test: 0
60+
61+
batch_size:
62+
train: 128
63+
val: 128
64+
test: 128
65+
66+
data_transforms: None

conf/logging/default.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# log frequency
2+
val_check_interval: 3
3+
progress_bar_refresh_rate: 10
4+
5+
wandb:
6+
name: ${expname}
7+
project: mofdiff
8+
entity: null
9+
log_model: True
10+
mode: 'online'
11+
group: ${expname}
12+
13+
tensorboard:
14+
save_dir: ${oc.env:LOG_DIR}/tensorboard
15+
16+
wandb_watch:
17+
log: 'all'
18+
log_freq: 5000
19+
20+
lr_monitor:
21+
logging_interval: "step"
22+
log_momentum: False

0 commit comments

Comments
 (0)