Skip to content

Commit d9eafe0

Browse files
committed
first commit
0 parents  commit d9eafe0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+303615
-0
lines changed

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
Copyright (c) 2024 Merck & Co., Inc., Rahway, NJ, USA and its affiliates. All rights reserved.
2+
3+
Armen Beck (MSD)
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

LICENSES_THIRD_PARTY

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
Bellow are listed all the third-party libraries, versions, and licenses
2+
that are utilized by Dedenser, to the best of our knowledge.
3+
4+
Name Version License
5+
Bottleneck 1.3.7 BSD License
6+
PyQt5 5.15.10 GPL v3
7+
PyQt5-sip 12.13.0 SIP
8+
Rtree 1.2.0 MIT License
9+
alphashape 1.3.1 MIT License
10+
click 8.1.7 BSD License
11+
click-log 0.4.0 MIT License
12+
colorama 0.4.6 BSD License
13+
contourpy 1.2.0 BSD License
14+
cycler 0.11.0 BSD License
15+
et-xmlfile 1.1.0 MIT License
16+
fonttools 4.25.0 MIT License
17+
future 1.0.0 MIT License
18+
joblib 1.2.0 BSD License
19+
kiwisolver 1.4.4 BSD License
20+
llvmlite 0.42.0 BSD
21+
matplotlib 3.8.0 Python Software Foundation License
22+
mkl-fft 1.3.8 BSD
23+
mkl-random 1.2.4 BSD
24+
mkl-service 2.4.0 BSD
25+
mordred 1.2.0 BSD-3-Clause
26+
munkres 1.1.4 Apache Software License
27+
networkx 2.8.8 BSD License
28+
numba 0.59.0 BSD License
29+
numexpr 2.8.7 MIT License
30+
numpy 1.23.5 BSD License
31+
numpy 1.26.4 BSD License
32+
openpyxl 3.0.10 MIT License
33+
packaging 23.2 Apache Software License; BSD License
34+
pandas 2.1.4 BSD License
35+
pillow 10.2.0 Historical Permission Notice and Disclaimer (HPND)
36+
ply 3.11 BSD
37+
point-cloud-utils 0.30.4 MIT License
38+
pynndescent 0.5.10 BSD
39+
pyparsing 3.0.9 MIT License
40+
python-dateutil 2.8.2 Apache Software License; BSD License
41+
pytz 2023.3.post1 MIT License
42+
rdkit 2023.9.5 BSD-3-Clause
43+
scikit-learn 1.3.0 BSD License
44+
scipy 1.11.4 BSD License
45+
shapely 2.0.3 BSD License
46+
sip 6.7.12 SIP
47+
six 1.16.0 MIT License
48+
threadpoolctl 2.2.0 BSD License
49+
tornado 6.3.3 Apache Software License
50+
tqdm 4.65.0 MIT License; Mozilla Public License 2.0 (MPL 2.0)
51+
trimesh 4.2.0 MIT License
52+
tzdata 2023.3 Apache Software License
53+
umap-learn 0.5.4 BSD

README.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Dedenser
2+
A Python tool for creating and downsampling chemical pointclouds.
3+
4+
## Overview
5+
6+
## Dependencies
7+
We recommend installing the necessary packages individualy if running Dedenser from source. Otherwise, YMLs with conda enviornments are provided in `envs`.
8+
9+
* alphashape
10+
* matplotlib
11+
* mordred
12+
* numpy
13+
* openpyxl
14+
* pandas
15+
* point-cloud-utils
16+
* rdkit
17+
* scikit-learn
18+
* scipy
19+
* umap-learn
20+
21+
## Using Dedenser
22+
Dedenser is packaged and writen with the intent of being used as a comand line interface tool. Although those who wish may utilise the code as they see fit, this tutorial should serve to assist those using the comand line interface functions.
23+
24+
### Generating Chemical Point Clouds
25+
26+
Users can generate chemical point clouds from files with the command:
27+
28+
```
29+
python -m dedenser mkcloud -o <path to output> <path of input>
30+
```
31+
32+
However, users may desire to use or be provided a list of SMILES. For this we provide the comand to make a chemical pointcloud using umap-learn for embedding chemical descriptors generated by Mordred/RDKit.
33+
34+
With a subset of ZINC, this can be done with the following command:
35+
36+
```
37+
python -m dedenser mkcloud -o data/ZINC_short_cloud data/ZINC_short.txt
38+
```
39+
```
40+
Loading Scikit-learn, RDKit, and Mordred...
41+
Finished loading dependencies, featurizing SMILES...
42+
Loading SMILES...
43+
Converting to Mols...
44+
Calculating 2D descriptors from Mols...
45+
100%|█████████████████████████████████████████████████████████████████████████████| 2000/2000 [01:07<00:00, 29.82it/s]
46+
Finished 2D descriptor calculations.
47+
Loading UMAP and embedding chemical point cloud...
48+
Done! Saved chemical point cloud at 'data/ZINC_short_cloud.npy'.
49+
Saved 2D descriptors at 'data/ZINC_short_cloud.csv'.
50+
```
51+
52+
The default column index for SMILES is 0, but can be user defined with the '-p' or '--pos' flags as such:
53+
```
54+
python -m dedenser mkcloud -p 3 -o <path to output> <path of input.txt>
55+
```
56+
For those not familiar with zero indexing, an index of 3 would indicate the 4th column in the datasheet.
57+
58+
59+
If users need to use delimeters beyond the default of ',' they can specify so with the '-s' or '--sep' flag as such:
60+
```
61+
python -m dedenser mkcloud -p 3 -s \t -o <path to output> <path of input.tsv>
62+
```
63+
64+
Additionally, if dealing with Excel sheets, the '-x' or '-excel' flags can be used (and will also save Excel sheets for other commands with outputs).
65+
```
66+
python -m dedenser mkcloud -x -p 3 -o <path to output> <path of input.xlsx>
67+
```
68+
For Excel sheets the specification of delimiters should not be needed.
69+
70+
Lastly, if headers are present, they can be ignored with the '-H' or '--header' flags.
71+
72+
### Visualizing a Chemical Point Cloud Natively
73+
74+
To simply visualize a chemical point cloud, the 'vis' command:
75+
```
76+
python -m dedenser vis data/ZINC_short_cloud.npy
77+
```
78+
![native cloud](data/ZINC_sc_vis.svg)
79+
80+
To save the figure, the 'vis' command requires the '-f' or '--fig' and the '-o' or '--path_out' flag with pathing:
81+
```
82+
python -m dedenser vis -f -o data/ZINC_sc_vis data/ZINC_short_cloud.npy
83+
```
84+
85+
### Downsampling with Dedenser
86+
87+
To downsample with Dedenser, the dedense command is used with the '-t' or '--targ' flags to specify the target percentage to be downsampled to:
88+
```
89+
python -m dedenser dedense -o data/ZINC_sc_d30 -t 0.3 data/ZINC_short_cloud.npy
90+
```
91+
```
92+
Loading dedenser...
93+
Dedensing...
94+
Target of 600 molecules
95+
Downsampled to 602 molecules
96+
Done! Saved dedensed index at: data/ZINC_sc_d30.npy
97+
```
98+
99+
Additionally, the '-a' or '--alpha' flags can be used to specify employment of alpha shapes/concave hulls instead of convex hulls when calculating the volumes of clusters, as well as the '-S' or '--strict' flags to completely drop clusters with calculated membership retentions bellow 1 (that would otherwise be brought up to 1). The difference in outputs resulting from use of these flags/parameters is highly dependent on the initial chemical point cloud being downsampled and the downsampling target, and may not be significant.
100+
101+
### Visualizing Downsampled Chemical Point Clouds
102+
103+
When visualizing a chemical point cloud that has been downsampled, the '-d' or '--down' flags should be used to specify the pathing for the indexes generated during downsampling.
104+
```
105+
python -m dedenser vis -f -d data/ZINC_sc_d30.npy -o data/ZINC_sc_d30_vis data/ZINC_short_cloud.npy
106+
```
107+
![dedensed cloud](data/ZINC_sc_d30_vis.svg)
108+
109+
### Making a Results Sheet
110+
111+
To make a sheet with the SMILES and chemical point cloud cordinates of the downsampled result the mksheet command is used. The '-c' or '--cloud' flags are used to specify the file path for the origional chemical point cloud, where '-d' or '--down' is used the same as when using the mksheet command.
112+
```
113+
python -m dedenser mksheet -c data/ZINC_short_cloud.npy -d data/ZINC_sc_d30.npy -o data/ZINC_sc_d30_sheet.csv data/ZINC_short.txt
114+
```
115+
```
116+
Completed with no errors, wrote results to data/ZINC_sc_d30_sheet.csv
117+
```
118+
119+
We can then open the sheet with our results:
120+
![sheet](data/resulting_sheet.gif)
121+
122+
123+
Note that this is the only time where the file handle for the output file should/can be specified!
124+
125+
Excel sheets cannot be specified as the output type if the input is not an Excel sheet. However, all files generated are comma delimited and can be read and rendered by Excel.
126+
127+
### Other Downsampling Options
128+
129+
The downsampling done earlier greatly reduced some dense regions in the chemical point cloud. To visualize the HDBSCAN clusters both before and
130+
after downsampling, the --SHOW flag can be used.
131+
132+
```
133+
python -m dedenser dedense --SHOW -o data/ZINC_sc_d30 -t 0.3 data/ZINC_short_cloud.npy
134+
```
135+
```
136+
Loading dedenser...
137+
Dedensing...
138+
```
139+
![clust](data/sc_clust.svg)
140+
```
141+
Target of 600 molecules
142+
Downsampled to 602 molecules
143+
```
144+
![clust](data/sc_clust_down.svg)
145+
```
146+
Done! Saved dedensed index at: data/ZINC_sc_d30.npy
147+
```
148+
149+
The number of clusters is quite low, and can be increased by lowering or decreased by increasing the 'min_size' HDBSCAN parameter.
150+
'min_size' has a default value of 5, and can be specified using the '-m' or '-min' flags.
151+
152+
```
153+
python -m dedenser dedense -m 15 --SHOW -o data/ZINC_sc_d30m15 -t 0.3 data/ZINC_short_cloud.npy
154+
```
155+
```
156+
Loading dedenser...
157+
Dedensing...
158+
```
159+
![clust](data/sc_clust_m15.svg)
160+
```
161+
Target of 600 molecules
162+
Downsampled to 594 molecules
163+
```
164+
![clust](data/sc_clust_down_m15.svg)
165+
```
166+
Done! Saved dedensed index at: data/ZINC_sc_d30.npy
167+
```
168+
169+
Here we can see that by increasing the minimum number of members for a group to be considered a cluster, the number of clusters is decreased. Further details are described in the scikit-learn documentation for [HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html) with key aspects surrounding minimum cluster size [here](https://scikit-learn.org/stable/modules/clustering.html#hdbscan).
170+
171+
One last key feature for Dedenser is the ability to downsample based on the density of clusters.
172+
This is done using weight parameterized exponentials that calculate normalized density coefficients ($D_x$), where the weighting term can be increased or decreased to negative values to emphasize the prioritization of dense or sparce clusters respectively (eq. 1).
173+
174+
(1) $D_{x}=e^{w(d_i/d_t-1)}/\displaystyle\sum_{i=1}^{n}e^{w(d_i/d_t-1)}, w= weight , n= number of clusters, d_t = \sum_{i=1}^{n}d_i$
175+
176+
Density coefficients are multiplied by the remaining target number of molecules ($R$) to be retained to calculate the target value for each cluster ($T_x$) to be downsampled to (eq. 2)
177+
178+
(2) $T_x = R*D_x$
179+
180+
This density based weighting can recover the downsampled clusters with high density from earlier:
181+
```
182+
python -m dedenser dedense -dw 1 --SHOW -o data/ZINC_sc_d30w1 -t 0.3 data/ZINC_short_cloud.npy
183+
```
184+
![clust](data/sc_clust_down_w1.svg)
185+
186+
The favoring of low density clusters can also be somewhat recovered by using negative weights:
187+
```
188+
python -m dedenser dedense -dw -200 --SHOW -o data/ZINC_sc_d30w-200 -t 0.3 data/ZINC_short_cloud.npy
189+
```
190+
![clust](data/sc_clust_down_w-200.svg)
191+
192+
The weighting may require some manual tuning depending on what is desired by the user.

0 commit comments

Comments
 (0)