pdb_cpp is a structural bioinformatics toolkit with a C++ core and Python API for fast PDB/mmCIF parsing, atom selection, sequence/structure alignment, TM-score, and DockQ evaluation.
- Read/write
.pdb,.cif,.pqr, and.grofiles - Atom/residue/chain selections (including geometric
withinqueries) - Sequence extraction and pairwise sequence alignment
- Sequence-based structural superposition and chain-permutation alignment
- TM-align/TM-score through the bundled USalign/TM-align core
- DockQ metrics (
DockQ,Fnat,Fnonnat,LRMS,iRMS,rRMS) - Hydrogen bond detection (Baker & Hubbard geometric method, no explicit H required)
- Solvent-accessible surface area (Shrake-Rupley) on
Model, plus buried protein-protein surface helper - Secondary structure assignment
- Core geometric helpers (e.g., distance matrix)
python -m pip install pdb-cppgit clone https://github.com/samuelmurail/pdb_cpp
cd pdb_cpp
python -m pip install -e .For development:
python -m pip install -r requirements.txt
pytestfrom pdb_cpp import Coor
# Load from local file
coor = Coor("tests/input/1y0m.cif")
# Or fetch by PDB ID (mmCIF is downloaded and cached)
coor_pdb = Coor(pdb_id="1y0m")
# Or use the RCSB helper for explicit structure choices
from pdb_cpp import rcsb
bio_assembly = rcsb.load("5a9z", structure="biological_assembly", assembly_id=1)
asym_unit = rcsb.load("5a9z", structure="asymmetric_unit")
print(coor.model_num) # number of models
print(coor.get_aa_seq()) # chain -> sequence
# Write selection/structure back to disk
coor.write("out_structure.pdb")select_atoms() supports boolean logic, numeric comparisons, and spatial queries:
# Backbone residues 6..58 from chain A
sel_1 = coor.select_atoms("backbone and chain A and residue >= 6 and residue <= 58")
# Interface-like query: atoms of chain A within 5 Å of chain B
sel_2 = coor.select_atoms("chain A and within 5.0 of chain B")
# Combination with negation
sel_3 = coor.select_atoms("name CA and not within 5.0 of resname HOH")
# Numeric filters
sel_4 = coor.select_atoms("resname HOH and x >= 20.0")Common keywords: name, resname, chain, resid, residue, x, y, z, beta, occ, protein, backbone, noh, within.
from pdb_cpp import Coor, alignment, core, analysis
coor_1 = Coor("tests/input/1u85.pdb")
coor_2 = Coor("tests/input/1ubd.pdb")
seq_1 = coor_1.get_aa_seq()["A"]
seq_2 = coor_2.get_aa_seq()["C"]
aln_1, aln_2, score = alignment.align_seq(seq_1, seq_2)
alignment.print_align_seq(aln_1, aln_2)
# Get atom correspondences and align coordinates in-place
idx_1, idx_2 = core.get_common_atoms(coor_1, coor_2, chain_1=["A"], chain_2=["C"])
core.coor_align(coor_1, coor_2, idx_1, idx_2, frame_ref=0)
# RMSD after alignment
rmsd_values = analysis.rmsd(coor_1, coor_2, index_list=[idx_1, idx_2])
print(rmsd_values[0])For multi-chain complexes with uncertain chain mapping, use chain permutation:
rmsds, index_mappings = alignment.align_chain_permutation(coor_1, coor_2)from pdb_cpp import Coor
from pdb_cpp.core import tmalign_ca
ref = Coor("tests/input/1y0m.cif")
mob = Coor("tests/input/1ubd.pdb")
result = tmalign_ca(ref, mob, chain_1=["A"], chain_2=["C"], mm=1)
print(result.L_ali) # aligned length
print(result.rmsd) # RMSD on aligned residues
print(result.TM1) # TM-score normalized by structure 1
print(result.TM2) # TM-score normalized by structure 2If you use the USalign/TM-align functionality in pdb_cpp, please cite:
- Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang (2022) Nat Methods. 19(9), 1109-1115.
- Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218.
Secondary structure helper:
from pdb_cpp import TMalign
ss = TMalign.compute_secondary_structure(ref)
print(ss[0]["A"]) # DSSP-like secondary structure string for chain Afrom pdb_cpp import Coor, analysis
model = Coor("tests/input/1rxz_colabfold_model_1.pdb")
native = Coor("tests/input/1rxz.pdb")
# Chain roles can be inferred automatically,
# or provided explicitly with rec_chains/lig_chains arguments.
scores = analysis.dockQ(model, native)
print(scores["DockQ"][0])
print(scores["Fnat"][0], scores["Fnonnat"][0])
print(scores["LRMS"][0], scores["iRMS"][0], scores["rRMS"][0])Command-line usage is available through the installed pdb_cpp_dockq bin,
which uses analysis.dockQ_multimer() and therefore supports the
automatic native-to-model chain mapping:
pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb
# Optional explicit native:model mapping
pdb_cpp_dockq tests/input/1a2k_model.pdb tests/input/1a2k.pdb --chain-map A:B,B:A,C:CIf you use DockQ scoring in pdb_cpp, please cite:
- DockQ, DOI: 10.1093/bioinformatics/btae586
pdb_cpp.hbond identifies hydrogen bonds using the Baker & Hubbard
geometric criteria. Hydrogen positions are reconstructed algebraically when
not present in the file, so no pre-processing step is required.
from pdb_cpp import Coor
from pdb_cpp import hbond
coor = Coor("tests/input/2rri.cif")
# One list of HBond objects per model frame
all_bonds = hbond.hbonds(coor)
print(f"Model 0: {len(all_bonds[0])} H-bonds")
# Inspect bond geometry
b = all_bonds[0][0]
print(f"Donor : {b.donor_chain}{b.donor_resid} {b.donor_heavy_name}")
print(f"Acceptor: {b.acceptor_chain}{b.acceptor_resid} {b.acceptor_name}")
print(f"d(D..A) = {b.dist_DA:.2f} Å ∠DHA = {b.angle_DHA:.1f}°")
# Cross-selection: protein donors to nucleic-acid acceptors
rna_bonds = hbond.hbonds(coor, donor_sel="protein", acceptor_sel="nucleic")Default cutoffs follow Baker & Hubbard (1984):
dist_DA_cutoff=3.5 Å, dist_HA_cutoff=2.5 Å, angle_cutoff=90°.
from pdb_cpp import Coor, analysis
coor = Coor("tests/input/1a2k.pdb")
# SASA returns one result per model
chain_a = analysis.sasa(coor, selection="chain A", by_residue=True)[0]
print(chain_a["total"])
print(chain_a["residue_areas"])
# Protein-protein buried surface from two selections
metrics = analysis.buried_surface_area(coor, "chain A", "chain B", by_residue=True)[0]
print(metrics["buried_surface"])
print(metrics["interface_area"])
print(metrics["residue_buried_surface"])For a protein-protein interface, call analysis.buried_surface_area() on the two
partner selections. The returned values follow:
buried_surface = sasa_receptor + sasa_ligand - sasa_complexinterface_area = buried_surface / 2
Example:
from pdb_cpp import Coor, analysis
coor = Coor("tests/input/1a2k.pdb")
interface = analysis.buried_surface_area(
coor,
receptor_sel="chain A",
ligand_sel="chain B",
by_residue=True,
)[0]
print(f"Complex SASA : {interface['complex_sasa']:.2f} A^2")
print(f"Receptor SASA : {interface['receptor_sasa']:.2f} A^2")
print(f"Ligand SASA : {interface['ligand_sasa']:.2f} A^2")
print(f"Buried surface : {interface['buried_surface']:.2f} A^2")
print(f"Interface area : {interface['interface_area']:.2f} A^2")
for residue in interface["residue_buried_surface"]:
print(
residue["partner"],
residue["chain"],
residue["resname"],
residue["resid"],
residue["buried_area"],
)from pdb_cpp import Coor, geom
coor = Coor("tests/input/1y0m.cif")
ca = coor.select_atoms("name CA")
dist = geom.distance_matrix(ca, ca)
print(dist.shape)- API and examples: https://samuelmurail.github.io/pdb_cpp/
- Docs sources:
docs/source/
When adding C++ features:
- Add implementation files in
src/pdb_cpp/_core/ - Register sources in
setup.py - Expose bindings in
src/pdb_cpp/_core/pybind.cpp - Reinstall extension (
pip install -e . --no-build-isolation) and run tests
