Multi-cohort cerebrospinal fluid proteomics identifies robust molecular signatures across the Alzheimer disease continuum
This repository contains the code for bioinformatics analyses described in the article "Multi-cohort cerebrospinal fluid proteomics identifies robust molecular signatures across the Alzheimer disease continuum".
This project investigated CSF proteomics data from the SomaScan 7K platform to identify proteins associated with Alzheimer disease. Idnetified proteins were leveraged to create AD-spcific prediction model, pseudo-trajectory analysis, biological pathway and cell type enrichment analyses to understand underlying AD biology.
The code covers the following main analysis steps:
- Data pre-processing: Proteomics data preparation and surrogate variable (SV) computation
- Differential expression analysis (Discovery, Replication, and Meta-analyses)
- AD prediction model development using Lasso regression
- Survival analysis to identify individuals that will conert to AD
- AD progression analysis to distinguish between slow and fast progressors
- Pseudo trajectory analysis to group/cluster proteins based on their expressin in AT continuum (A-T-, A+T-, A+T+)
- Network and pathway enrichment analyses
- Cell type enrichment analysis
Proteomics data analysed in this study is available at:
- ADNI: http://adni.loni.usc.edu/
- Knight-ADRC: https://dss.niagads.org/ (Accession: ng00130)
- FACE and Barcelona-1 cohorts: http://www.fundacioace.com/
- PPMI: https://www.ppmi-info.org/
- Stanford-ADRC: https://web.stanford.edu/group/adrc/cgi-bin/web-proj/datareq.php
The code was written in R (version 4.3.0) and relies on multiple R and Bioconductor packages, including:
- sva
- clusterProfiler
- scran
- glmnet
- nlme
- pROC
- igraph
- survminer
- mclust
- Additional packages listed at the beginning of each R script
The code is available under the MIT License.
The code was tested on R 4.3.0 on Linux operating systems, but should be compatible with later versions of R installed on current Linux, Mac, or Windows systems.
To run the code, the correct working directory containing the input data must be specified at the beginning of the R-scripts, otherwise the scripts can be run as-is.
The scripts should be run in the following order:
data_preparation.R
differential_expression_analysis.R
prediction_modeling.R
survival_analysis.R
progression_analysis.R
clustering_pseudo_trajectories.R
network_and_pathway_analysis.R
cell_type_enrichment.R