Skip to content

Commit e509ad8

Browse files
committed
Expand ZINC sample from 1k to 10k and regenerate gXTB predictions
- Update plot_zinc_vs_cawkwell.py: sample 10,000 ZINC molecules, remove stale cawkwell_si_atom_counts.csv reference, set 8 threads - Regenerate zinc_gxtb_predictions.csv (9,990 successful predictions) - Regenerate zinc_vs_cawkwell_gxtb_histogram.png with 10k sample - Update README with 10k sample description
1 parent 840cbe7 commit e509ad8

5 files changed

Lines changed: 20006 additions & 1011 deletions

File tree

analysis/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,14 @@ The training data (`deltahf/data/training_data.csv`) contains 531 molecules acro
1818

1919
---
2020

21-
## ZINC vs Cawkwell Comparison
21+
## Comparing typical drug-like molecules with energetic CHNO molecules
2222

2323
**Script:** `plot_zinc_vs_cawkwell.py`
2424

2525
This script compares predicted ΔHf° distributions for two molecule sets:
2626

2727
- **Cawkwell energetic set** (531 molecules) — energetic CHNO molecules from Cawkwell et al. (2021)
28-
- **ZINC drug-like sample** (1,000 molecules) — randomly sampled from the ZINC 250k drug-like dataset, filtered to supported elements and neutralised
28+
- **ZINC drug-like sample** (10,000 molecules) — randomly sampled from the ZINC 250k drug-like dataset, filtered to supported elements and neutralised
2929

3030
The comparison assesses how the predicted ΔHf° distributions differ between energetic CHNO molecules and typical drug-like molecules.
3131

@@ -60,7 +60,7 @@ The same comparison using xTB + `bondorder_ext` shows a similar pattern. The dis
6060
| File | Description |
6161
|------|-------------|
6262
| `250k_rndm_zinc_drugs_clean_3.csv` | ZINC 250k drug-like dataset (source data) |
63-
| `zinc_sample_1000.csv` | 1,000-molecule random sample (neutralised, supported elements only) |
63+
| `zinc_sample_10000.csv` | 10,000-molecule random sample (neutralised, supported elements only) |
6464
| `cawkwell_energetic.csv` | 531 energetic CHNO molecules from Cawkwell et al. (2021) |
6565
| `cawkwell_gxtb_predictions.csv` | gXTB + bondorder_ext predictions for Cawkwell energetic set |
6666
| `zinc_gxtb_predictions.csv` | gXTB + bondorder_ext predictions for ZINC sample |

analysis/plot_zinc_vs_cawkwell.py

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,8 @@
1515
PLOTS_DIR = Path(__file__).parent
1616
REPO_ROOT = PLOTS_DIR.parent
1717

18-
CAWKWELL_CSV = PLOTS_DIR / "cawkwell_si_atom_counts.csv"
1918
ZINC_CSV = PLOTS_DIR / "250k_rndm_zinc_drugs_clean_3.csv"
20-
ZINC_SAMPLE_CSV = PLOTS_DIR / "zinc_sample_1000.csv"
19+
ZINC_SAMPLE_CSV = PLOTS_DIR / "zinc_sample_10000.csv"
2120
CAWKWELL_INPUT_CSV = PLOTS_DIR / "cawkwell_energetic.csv"
2221
CAWKWELL_OUT_CSV = PLOTS_DIR / "cawkwell_gxtb_predictions.csv"
2322
ZINC_OUT_CSV = PLOTS_DIR / "zinc_gxtb_predictions.csv"
@@ -36,12 +35,7 @@ def neutralize_smiles(smiles: str) -> str:
3635

3736

3837
def prepare_inputs(seed: int = 42):
39-
# Cawkwell: extract smiles + name
40-
cawk = pd.read_csv(CAWKWELL_CSV)[["smiles", "name"]]
41-
cawk.to_csv(CAWKWELL_INPUT_CSV, index=False)
42-
print(f"Cawkwell input: {len(cawk)} molecules -> {CAWKWELL_INPUT_CSV}")
43-
44-
# ZINC: filter to supported elements, sample 1000, then neutralize
38+
# ZINC: filter to supported elements, sample 10000, then neutralize
4539
zinc = pd.read_csv(ZINC_CSV)
4640
def has_supported_elements(smi):
4741
mol = Chem.MolFromSmiles(smi.strip())
@@ -51,7 +45,7 @@ def has_supported_elements(smi):
5145
zinc["smiles"] = zinc["smiles"].str.strip()
5246
supported = zinc[zinc["smiles"].apply(has_supported_elements)]
5347
print(f"ZINC: {len(supported)}/{len(zinc)} molecules have supported elements")
54-
sample = supported.sample(n=1000, random_state=seed)[["smiles"]]
48+
sample = supported.sample(n=10000, random_state=seed)[["smiles"]]
5549
sample["smiles"] = sample["smiles"].apply(neutralize_smiles)
5650
print(f"ZINC sample: {len(sample)} molecules (neutralized) -> {ZINC_SAMPLE_CSV}")
5751
sample.to_csv(ZINC_SAMPLE_CSV, index=False)
@@ -66,7 +60,7 @@ def run_predict(input_csv: Path, output_csv: Path, label: str):
6660
"-i", str(input_csv),
6761
"--model", "bondorder_ext",
6862
"--use-gxtb",
69-
"--xtb-threads", "16",
63+
"--xtb-threads", "8",
7064
"--cache-dir", str(CACHE_DIR / label),
7165
"-o", str(output_csv),
7266
]

0 commit comments

Comments
 (0)