|
1 | | -# Project |
| 1 | +# MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design |
2 | 2 |
|
3 | | -> This repo has been populated by an initial template to help get you started. Please |
4 | | -> make sure to update the content to build a great experience for community-building. |
| 3 | +`mofdiff` is a diffusion model for generating coarse-grained MOF structures. This codebase also contains the code for deconstructing/reconstructing the all-atom MOF structures to train MOFDiff and assemble CG structures generated by MOFDiff. |
5 | 4 |
|
6 | | -As the maintainer of this project, please make a few updates: |
| 5 | +[paper](https://arxiv.org/abs/2310.10732) | [data and pretained models](https://zenodo.org/uploads/10467288) |
7 | 6 |
|
8 | | -- Improving this README.MD file to provide a great experience |
9 | | -- Updating SUPPORT.MD with content about this project's support experience |
10 | | -- Understanding the security reporting process in SECURITY.MD |
11 | | -- Remove this section from the README |
| 7 | +If you find this code useful, please consider referencing our paper: |
12 | 8 |
|
13 | | -## Contributing |
| 9 | +``` |
| 10 | +@article{fu2023mofdiff, |
| 11 | + title={MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design}, |
| 12 | + author={Fu, Xiang and Xie, Tian and Rosen, Andrew S and Jaakkola, Tommi and Smith, Jake}, |
| 13 | + journal={arXiv preprint arXiv:2310.10732}, |
| 14 | + year={2023} |
| 15 | +} |
| 16 | +``` |
14 | 17 |
|
15 | | -This project welcomes contributions and suggestions. Most contributions require you to agree to a |
16 | | -Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us |
17 | | -the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. |
| 18 | +## Table of Contents |
18 | 19 |
|
19 | | -When you submit a pull request, a CLA bot will automatically determine whether you need to provide |
20 | | -a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions |
21 | | -provided by the bot. You will only need to do this once across all repos using our CLA. |
| 20 | +- [Installation](#installation) |
| 21 | +- [Dowlnload data](#download-data) |
| 22 | +- [Training](#training) |
| 23 | +- [Generating MOF structures](#generating-mof-structures) |
| 24 | +- [Assemble all-atom MOFs](#assemble-all-atom-mofs) |
| 25 | +- [Relax MOFs](#relax-mofs) |
22 | 26 |
|
23 | | -This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). |
24 | | -For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or |
25 | | -contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. |
| 27 | +## Installation |
26 | 28 |
|
27 | | -## Trademarks |
| 29 | +We recommend using [mamba](https://mamba.readthedocs.io/en/latest/) (much faster than conda) to install the dependencies. First install `mamba` following the intructions in the [mamba repository](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html). |
28 | 30 |
|
29 | | -This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft |
30 | | -trademarks or logos is subject to and must follow |
31 | | -[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). |
32 | | -Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. |
33 | | -Any use of third-party trademarks or logos are subject to those third-party's policies. |
| 31 | + |
| 32 | +Install dependencies via `mamba`: |
| 33 | + |
| 34 | +``` |
| 35 | +mamba env create -f env.yml |
| 36 | +``` |
| 37 | + |
| 38 | +Then install `mofdiff` as a package: |
| 39 | + |
| 40 | +``` |
| 41 | +pip install -e . |
| 42 | +``` |
| 43 | + |
| 44 | +We use [MOFid](https://github.com/snurr-group/mofid) for preprocessing and analysis. Install MOFid following the instruction in the [MOFid repository](https://github.com/snurr-group/mofid/blob/master/compiling.md). The generative modeling part of this codebase does not depend on MOFid. |
| 45 | + |
| 46 | +## Download data |
| 47 | + |
| 48 | +You can download the preprocessed data from [Zenodo](https://zenodo.org/uploads/10467288) (recommended). |
| 49 | + |
| 50 | +Alternatively, you can download the `BW-DB` raw data from [Materials Cloud](https://archive.materialscloud.org/record/2018.0016/v3) and preprocess the data with the following command (assuming the data is downloaded to `${raw_path}`, this step requires MOFid): |
| 51 | + |
| 52 | +``` |
| 53 | +python preprocessing/extract_mofid.py --df_path ${raw_path}/all_MOFs_screening_data.csv --cif_path ${raw_path}/cifs --save_path ${raw_path}/mofid |
| 54 | +python preprocessing/preprocess.py --dataset_path |
| 55 | +python preprocessing/save_to_lmdb.py |
| 56 | +``` |
| 57 | + |
| 58 | +The preprocessing inovlves 3 steps: |
| 59 | +1. Extract the MOFid for all structures (CPU). |
| 60 | +2. Construct CG MOF data objects from MOFid deconstruction results (CPU or GPU). |
| 61 | +3. Save the CG MOF objects to an LMDB database (relatively fast). |
| 62 | + |
| 63 | +The entire preprocessing process for `BW-DB` may take several days (depending on the CPU/GPU resources). |
| 64 | + |
| 65 | +## Training |
| 66 | + |
| 67 | +First, configure the `.env` file to set correct paths to various directories. An [example](./.env) `.env` file is provided in the repository. |
| 68 | + |
| 69 | +### training the building block encoder |
| 70 | + |
| 71 | +Before training the diffusion model, we need to train the building block encoder. The building block encoder is a graph neural network that encodes the building blocks of MOFs. The building block encoder is trained with the following command: |
| 72 | + |
| 73 | +``` |
| 74 | +python mofdiff/scripts/train.py --config-name=bb |
| 75 | +``` |
| 76 | + |
| 77 | +The default output directory is `${oc.env:HYDRA_JOBS}/bb/${expname}/`. `oc.env:HYDRA_JOBS` is configured in `.env`. `expname` is configured in `configs/bb.yaml`. We use [hydra](https://hydra.cc/) for config management. All configs are stored in `configs/` You can override the default output directory with command line arguments. For example: |
| 78 | + |
| 79 | +``` |
| 80 | +python mofdiff/scripts/train.py --config-name=bb expname=bwdb_bb_dim_64 model.latent_dim=64 |
| 81 | +``` |
| 82 | + |
| 83 | +Logging is done with [wandb](https://wandb.ai/site) by default. You need to login to wandb with `wandb login` before training. The training logs will be saved to the wandb project `mofdiff`. You can also override the wandb project with command line arguments. You can also disable wandb logging by removing the `wandb` entry in the [config](./conf/logging/default.yaml). |
| 84 | + |
| 85 | +### training coarse-grained diffusion model for MOFs |
| 86 | + |
| 87 | +The output directory where the building block encoder is saved: `bb_encoder_path` is needed for training the diffusion model. With the building block encoder trained to convergence, train the CG diffusion model with the following command: |
| 88 | + |
| 89 | +``` |
| 90 | +python mofdiff/scripts/train.py data.bb_encoder_path=${bb_encoder_path} |
| 91 | +``` |
| 92 | + |
| 93 | +For BW-DB, training the building block encoder takes roughly 3 days and training the diffusion model takes roughly 5 days on a single NVIDIA V100 GPU. |
| 94 | + |
| 95 | +## Generating CG MOF structures |
| 96 | + |
| 97 | +Pretrained models can be found [here](https://zenodo.org/record/10467288). |
| 98 | + |
| 99 | +With a trained CG diffusion model `${diffusion_model_path}`, generate random CG MOF structures with the following command: |
| 100 | + |
| 101 | +``` |
| 102 | +python mofdiff/scripts/sample.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path} |
| 103 | +``` |
| 104 | + |
| 105 | +`${bb_cache_path}` is the path to the building block embedding space, saved at the beginning of CG diffusion model training. To optimize MOF structures for a property (e.g., CO2 adsorption working capacity), use the following command: |
| 106 | + |
| 107 | +``` |
| 108 | +python mofdiff/scripts/optimize.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path} --data_path ${data_path} |
| 109 | +``` |
| 110 | + |
| 111 | +Available arguments for `sample.py` and `optimize.py` can be found in the respective files. The generated CG MOF structures will be saved in `${sample_path}=${diffusion_model_path}/${sample_tag}` as `samples.pt`. |
| 112 | + |
| 113 | +The CG structures generated with the diffusion model are not guaranteed to be realizable. We need to assemble the CG structures to recover the all-atom MOF structures. The following sections describe how to assemble the CG MOF structures, and all steps further do not require a GPU. |
| 114 | + |
| 115 | +## Assemble all-atom MOFs |
| 116 | + |
| 117 | +Assembled the CG MOF structures with the following command: |
| 118 | + |
| 119 | +``` |
| 120 | +python mofdiff/scripts/assemble.py --input ${sample_path}/samples.pt |
| 121 | +``` |
| 122 | + |
| 123 | +This command will assemble the CG MOF structures in `${sample_path}` and save the assembled MOFs in `${sample_path}/assembled.pt`. The cif files of the assembled MOFs will be saved in `${sample_path}/cif`. If the assembled MOFs came from property-driven optimization, the optimization arguments are saved to `${sample_path}/opt_args.json`. |
| 124 | + |
| 125 | +## Relax MOFs and compute structural properties |
| 126 | + |
| 127 | +The assembled structures may not be physically plausible. These MOF structures are relaxed uses the UFF force field with LAMMPS. LAMMPS is already installed if you have followed the installation instructions in this README. The script for relaxing the MOF structures also compute structural properties (e.g., pore volume, surface area, etc.) with [Zeo++](https://www.zeoplusplus.org/download.html) and the mofids of the generated MOFs with [MOFid](https://github.com/snurr-group/mofid/tree/master). The respective packages should be installed following the instructions in the respective repositories, and the corresponding paths should be added to `.env` before running the following command. Each step should take no more than a few minutes to complete on a single CPU. We use multiprocessing to parallelize the computation. |
| 128 | + |
| 129 | +Relax MOFs and compute structural properties with the following command: |
| 130 | + |
| 131 | +``` |
| 132 | +python mofdiff/scripts/uff_relax.py --input ${sample_path} |
| 133 | +``` |
| 134 | + |
| 135 | +This command will relax the assembled MOFs in `${sample_path}/cif` and save the relaxed MOFs in `${sample_path}/relaxed`. The structural properties of the relaxed MOFs will be saved in `${sample_path}/relaxed/zeo_props_relax.json`. The mofids of the relaxed MOFs will be saved in `${sample_path}/mofid`. |
| 136 | + |
| 137 | + |
| 138 | +## GCMC simulation for gas adsorption |
| 139 | + |
| 140 | +To run GCMC simulations, first install RASPA2 (simulation software) and eGULP (charge calculation software). |
| 141 | + |
| 142 | +RASPA2 can be installed with `pip`: |
| 143 | + |
| 144 | +``` |
| 145 | +pip install "RASPA2==2.0.4" |
| 146 | +``` |
| 147 | + |
| 148 | +You may need to install the following Linux dependencies first: |
| 149 | + |
| 150 | +``` |
| 151 | +apt-get update |
| 152 | +apt-get install -yq libgsl0-dev pkg-config libxrender-dev |
| 153 | +``` |
| 154 | + |
| 155 | +Install [eGULP](https://github.com/danieleongari/egulp) following the instruction in the repository. The following commands install eGULP in `/usr/local/bin/egulp`: |
| 156 | + |
| 157 | +``` |
| 158 | +mkdir /usr/local/bin/egulp && tar -xf egulp.tar -C /usr/local/bin/egulp |
| 159 | +cd /usr/local/bin/egulp/src && make && cd - |
| 160 | +``` |
| 161 | + |
| 162 | +Then, decompress the [force field parameters](./mofdiff/gcmc/UFF-TraPPe-scaled.tar) to the RASPA directory using the following commands (assuming RASPA2 installed in `RASPA_PATH=PYTHONPATH/site-packages/RASPA2` with `pip`): |
| 163 | + |
| 164 | +``` |
| 165 | +tar -xf UFF-TraPPe-scaled.tar -C RASPA_PATH/share/raspa/forcefield/UFF-TraPPe |
| 166 | +``` |
| 167 | + |
| 168 | +Calculate charges for relaxed samples in `${sample_path}` with the following command: |
| 169 | + |
| 170 | +``` |
| 171 | +python mofdiff/scripts/calculate_charges.py --input ${sample_path} |
| 172 | +``` |
| 173 | + |
| 174 | +This command will output cif files with charge information under `${sample_path}/mepo_qeq_charges`. |
| 175 | + |
| 176 | + |
| 177 | +Run GCMC simulations with the following command: |
| 178 | + |
| 179 | + |
| 180 | +``` |
| 181 | +python mofdiff/scripts/gcmc_screen.py --input ${sample_path}/mepo_qeq_charges |
| 182 | +``` |
| 183 | + |
| 184 | +The GCMC simulation results will be saved in `${sample_path}/gcmc/screening_results.json`. |
| 185 | + |
| 186 | +## Acknowledgement |
| 187 | + |
| 188 | +This codebase is based on several existing repositories: |
| 189 | + |
| 190 | +- [CDVAE](https://github.com/txie-93/cdvae) |
| 191 | +- [Open catalyst project](https://github.com/Open-Catalyst-Project/ocp) |
| 192 | +- [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric) |
| 193 | +- [PyTorch](https://github.com/pytorch/pytorch) |
| 194 | +- [Lightning](https://github.com/Lightning-AI/pytorch-lightning/) |
| 195 | +- [Hydra](https://github.com/facebookresearch/hydra) |
0 commit comments