You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PLINDER - The Protein Ligand INteractions Dataset and Evaluation Resource
2
+
Copyright (c) 2024, Plinder Development Team
3
+
4
+
The PLINDER project is a collaboration between the
5
+
University of Basel, SIB Swiss Institute of Bioinformatics,
6
+
VantAI, NVIDIA, and MIT CSAIL.
7
+
8
+
If you find this software useful, please cite:
9
+
10
+
Durairaj, Janani, Yusuf Adeshina, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duignan, Zachary McClure, et al. “PLINDER: The Protein-Ligand Interactions Dataset and Evaluation Resource.” bioRxiv, July 17, 2024, 2024.07.17.603955. https://doi.org/10.1101/2024.07.17.603955.
**plinder**, short for **p**rotein **l**igand **in**teractions **d**ataset and **e**valuation **r**esource,
18
+
is a dataset and resource for training and evaluation of protein-ligand docking algorithms.
19
+
It is a comprehensive, annotated, high quality dataset:
20
+
21
+
-\> 400k PLI systems across > 11k SCOP domains and > 50k unique small molecules
22
+
- 500+ annotations for each system, including protein and ligand properties, quality, matched molecular series and more
23
+
- Automated curation pipeline to keep up with the PDB
24
+
- 14 PLI metrics and over 20 billion similarity scores
25
+
- Unbound \(_apo_\) and _predicted_ Alphafold2 structures linked to _holo_ systems
26
+
-`train-val-test` splits and ability to tune splitting based on the learning task
27
+
- Robust evaluation harness to simplify and standard performance comparison between models
13
28
14
29
# 📢 Notice
15
30
@@ -19,32 +34,16 @@ VantAI, NVIDIA, MIT CSAIL, and the community at large.
19
34
If you find `plinder` useful,
20
35
please see the citation file for details on how to cite.
21
36
22
-
# 🚧 Under construction
23
-
24
-
Please bear with us as we migrate the `plinder` project to
25
-
open source as we work to share it with the world. There are
26
-
some gaps in the code and documentation, which will be fixed
27
-
as soon as possible. The dataset itself is complete, but the
28
-
code to interact with some parts of the dataset is still under
29
-
development.
30
-
31
-
# 📚 About
32
-
33
-
**plinder**, short for **p**rotein **l**igand **in**teractions **d**ataset and **e**valuation **r**esource,
34
-
is a dataset and resource for training and evaluation of protein-ligand docking algorithms.
35
-
36
37
# 👨💻 Getting Started
37
38
38
39
Please use a virtual environment for the `plinder` project.
39
40
We recommend the [miniforge](https://github.com/conda-forge/miniforge) environment manager.
40
41
41
-
42
42
**NOTE**: We currently only support a Linux environment. `plinder`
43
43
uses `openstructure` for some of its functionality and is available
44
44
from the `aivant` conda channel using `conda install aivant::openstructure`, but it is only built targeting Linux architectures.
45
45
For MacOS users, please see the relevant [docker](#package-publishing) resources below.
46
46
47
-
48
47
## Install plinder
49
48
50
49
The `plinder` package can be obtained from GitHub:
@@ -60,8 +59,7 @@ Or with a development installation:
60
59
cd plinder
61
60
pip install -e '.[dev]'
62
61
63
-
64
-
# ⬇️ Getting the dataset
62
+
# ⬇️ Getting the dataset
65
63
66
64
Using the `plinder.core` API, you can transparently and lazily
67
65
download and interact with most of the components of the dataset.
@@ -109,21 +107,37 @@ with the dataset.
109
107
110
108
## 🏅 Gold standard benchmark sets
111
109
112
-
Discuss stratification efforts
110
+
As part of `plinder` resource we also provide train, validation and test splits that are curated to minimize the information leakage based on protein-ligand interaction similarity. In addition, we have prioritized the systems that has a linked experimental `apo` structure or matched molecular series to support realistic inference scenarios for hit discovery and optimization.
111
+
Finally, a particular care is taken for test set that is further prioritized to contain high quality structures to provide unambiguous ground-truths for performance benchmarking.
Moreover, as we enticipate this resource to be used for benchmarking a wide range of methods, including those simultaneously predicting protein structure (aka. co-folding) or those generating novel ligand structures, we further stratified test (by novel ligand, pocket, protein or all) to cover a wide range of tasks.
116
+
117
+
Our latest test split [#TODO] contains:
118
+
119
+
| Novel | # of systems | # of high quality | stratification criteria |
0 commit comments