An immune cell classification framework utilizing advanced supervised and unsupervised learning techniques
Transforming single-cell RNA sequencing data into precise immune cell classification
ML-ImmuneProfiler is a sophisticated machine learning framework designed for accurate classification of immune cell types from single-cell RNA sequencing data. This project synthesizes various ML techniques to deliver robust cell type identification, with applications in immunology, precision medicine, and computational biology.
Note
Key Components
- Advanced Data Processing: Robust preprocessing pipelines for high-dimensional scRNA-seq data
- Multi-model Architecture: Implementation of diverse ML algorithms including Random Forests, Boosted Random Forests, Support Vector Machines, Logistic Regression, and clustering techniques
- Comprehensive Evaluation: Rigorous assessment using ROC curves, precision-recall curves, confusion matrices, silhouette scores, and accuracy metrics
- Interpretability Framework: Feature importance analysis and model explanation techniques
- Reproducible Environment: Complete configuration for both CPU and GPU environments
- π BRF_RF.yaml - Configuration for Boosted Random Forest and Random Forest models
- π env.yml - Environment configuration file for EDA and unsupervised notebook
- π gpu_env.yml - GPU-specific environment configuration for SVM notebook
- π System_Info_BRF_RF.txt - System information for BRF/RF models
- π System_Info_unspv_spv.txt - System information for supervised/unsupervised models
- π scRNA-seq_raw_data_curation.csv - Tabulated list of raw single-cell RNA-seq datasets curated from 10X Genomics database
- π batch_corrected_expression_with_celltypes.tsv.gz - Tab seperated matrix post-processing the raw scRNA-seq datasets - utilized for EDA and ML modelling
- π spv_split_dataset_100hvg.pkl - Split supervised dataset with 100 highly variable genes
- π supervised_data_100hvg_metadata.pkl - Metadata for supervised dataset
- π supervised_data_100hvg.pkl - Supervised dataset with 100 highly variable genes
- π Linear_ridge_classifiers - Directory containing linear ridge classifier models
- π logistic_regression - Directory containing logistic regression models
- π BRF_hypersmurf_ensemble.pkl - Boosted Random Forest with hypersmurf ensemble
- π Random_Forest_Model.pkl - Random Forest model
- π svm_cuML_model.pkl - SVM model using cuML (GPU-accelerated)
- π scRNA-seq_data_preprocessing.ipynb - Notebook for scRNA-seq data pre-processing using seurat
- π ml_eda_unspv_nb01.ipynb - Notebook for EDA on unsupervised data
- π ml_spv_data_process_nb02.ipynb - Notebook for supervised data processing
- π ml_spv_LC_LR_nb03.ipynb - Notebook for Linear Classifier and Logistic Regression
- π ml_spv_rf_brf_nb05.ipynb - Notebook for Random Forest and Boosted Random Forest
- π ml_spv_svm_nb04.ipynb - Notebook for SVM model
- π BRF - Plots for Boosted Random Forest models
- π DIMRED_CLUST - Dimensionality reduction and clustering plots
- π EDA - Exploratory Data Analysis plots
- π LR - Logistic Regression plots
- π LRC - Linear Ridge Classifier plots
- π RF - Random Forest plots
- π SVM - Support Vector Machine plots
- Contains method info markdown
- Contains project results (evaluation metrics, predictions, etc.)
The project follows a structured workflow to ensure clarity and reproducibility:
- Data Preparation: Preprocessing and cleaning of scRNA-seq datasets (scRNA-seq_data_preprocessing.ipynb)
- Exploratory Data Analysis (EDA): Initial analysis to understand data distributions and relationships
- Data exploration and unsupervised analysis (ml_eda_unspv_nb01)
- Supervised data processing (ml_spv_data_process_nb02)
- Model training and evaluation using different algorithms:
- Linear Classifier and Logistic Regression (ml_spv_LC_LR_nb03)
- Support Vector Machine (ml_spv_svm_nb04)
- Random Forest and Boosted Random Forest (ml_spv_rf_brf_nb05)
- Results visualization stored in the Plots directory
- Final reports generation
Our approach follows a rigorous research pipeline:
-
Data Preprocessing
- Quality control and batch correction and normalization (Done in R using Seurat)
- Feature selection through highly variable gene identification
- Dimensionality reduction via PCA/t-SNE/UMAP
- Train-validation-test split with stratification
-
Model Development
- Supervised Learning: Linear Ridge Classifier, Logistic Regression, SVM, Random Forests, Boosted Random Forests
- Unsupervised Learning: Clustering algorithms for cell type identification
- Hyperparameter Optimization: Grid search and Bayesian optimization
- Model Evaluation: Cross-validation and performance metrics analysis
-
Evaluation Framework
- Classification metrics: accuracy, precision, recall, F1-score
- ROC and Precision-Recall curves
- Clustering quality metrics: silhouette scores, adjusted Rand index
- Model Evaluation: Cross-validation and performance metrics analysis
-
Visualization: Comprehensive visualizations for model performance and data exploration
-
Documentation: Detailed method info and documentation for reproducibility
- Seurat and associated libraries and packages for preprocessing raw scRNA-seq dataseta
- Python for machine learning implementation (scikit-learn, pandas, NumPy, matplotlib, seaborn)
- Jupyter notebooks for interactive development
- GPU acceleration for some models (cuML for SVM)
- Various ML algorithms (SVM, Random Forest, Boosted Random Forest, Linear/Logistic Regression)
To ensure reproducibility across platforms, we provide comprehensive environment configurations:
-
Clone the Repository
git clone https://github.com/Birendra-Kumar-S/ML-ImmuneProfiler.git cd ML-ImmuneProfiler -
Environment Configuration
Choose the appropriate configuration based on your computational resources and type of notebook:
-
Standard Environment For EDA and unsupervised learning Notebook1,2,3:
conda env create -f Config/env.yml conda activate ml_project
-
GPU-Accelerated Environment For SVM Notebook4:
conda env create -f Config/gpu_env.yml conda activate ml_gpu
-
RF and BRF Environment For Random and Boosted Random Forest Notebook5:
conda env create -f Config/BRF_RF.yaml conda activate brf_dnn
-
Navigate to the appropriate notebook based on your analytical needs:
- scRNA-seq data pre-processing:
Notebooks/scRNA-seq_data_preprocessing.ipynb - Exploratory Analysis:
Notebooks/ml_eda_unspv_nb01.ipynb - Supervised Learning:
Notebooks/ml_spv_data_process_nb02.ipynb- Data processing for supervised learningNotebooks/ml_spv_LC_LR_nb03.ipynb- Linear Classifier and Logistic RegressionNotebooks/ml_spv_svm_nb04.ipynb- Support Vector Machine implementationNotebooks/ml_spv_rf_brf_nb05.ipynb- Random Forest and Boosted Random Forest
Pre-trained models are available in the Models directory and can be loaded using Python's pickle module.
Our comprehensive visualization suite can be accessed in multiple formats:
- Static Visualizations: Review performance metrics and comparative analysis in the
Plots/directory, organized by model type (BRF, RF, SVM, LR, LRC, EDA, DIMRED_CLUST) - MD Documentation: Printer-friendly versions of all methodology documentation are available in the
Reports/Method_Infodirectory
To validate our results:
- Load the pre-trained models from the Models directory
- Execute the test scripts to reproduce prediction files in Results
- Compare outputs against our benchmark results
[1] G. X. Zheng et al., βMassively parallel digital transcriptional profiling of single cells,β Nat. Commun., vol. 8, no. 1, p. 14049, 2017. [Online]. Available: https://doi.org/10.1038/ncomms14049
[2] T. Stuart et al., βComprehensive integration of single-cell data,β Cell, vol. 177, no. 7, pp. 1888-1902.e21, 2019. [Online]. Available: https://doi.org/10.1016/j.cell.2019.05.031
[3] L. Luecken and F. Theis, βCurrent best practices in single-cell RNA-seq analysis: A tutorial,β Mol. Syst. Biol., vol. 15, no. 6, p. e8746, 2019. [Online]. Available: https://doi.org/10.15252/msb.20188746
[4] Galaxy Training Network, βSingle-cell RNA-seq data analysis tutorial,β Galaxy Training Material, 2023. [Online]. Available: https://training.galaxyproject.org [Accessed: 19-Feb-2025].
[5] R. Edgar, M. Domrachev, and A. E. Lash, βGene Expression Omnibus: NCBI gene expression and hybridization array data repository,β Nucleic Acids Res., vol. 30, no. 1, pp. 207β210, 2002. [Online]. Available: https://doi.org/10.1093/nar/30.1.207
[6] C. Megill et al., βCellxGene: A performant interactive exploration of large-scale single-cell gene expression data,β bioRxiv, 2021. [Online]. Available: https://doi.org/10.1101/2021.04.05.438318
[7] C. DomΓnguez Conde et al., βCross-tissue immune cell analysis reveals tissue-specific adaptations and clonal architecture of tissue-resident memory T cells,β Nat. Immunol., vol. 23, no. 5, pp. 718β726, 2022. [Online]. Available: https://doi.org/10.1038/s41590-022-01149-3
[8] M. Kumar et al., βA machine learning approach for immune cell annotation in single-cell RNA-seq data,β Front. Immunol., vol. 13, p. 831648, 2022. [Online]. Available: https://doi.org/10.3389/fimmu.2022.831648
[9] E. Abdelaal et al., βA comparison of automatic cell identification methods for single-cell RNA sequencing data,β Genome Biol., vol. 20, no. 1, p. 194, 2019. [Online]. Available: https://doi.org/10.1186/s13059-019-1795-z
[10] V. Y. Kiselev, T. S. Andrews, and M. Hemberg, βscmap: Projection of single-cell RNA-seq data across data sets,β Nat. Methods, vol. 15, no. 5, pp. 359β362, 2018. [Online]. Available: https://doi.org/10.1038/s41592-018-0037-6
[11] Y. Zhang, Y. Liu, L. Wang, et al., βsc-ImmuCC: accurate and efficient identification of the immune cell composition in bulk RNA-Seq data via gene set signature analysis,β Bioinformatics, vol. 35, no. 18, pp. i65βi73, 2019. [Online]. Available: https://doi.org/10.1093/bioinformatics/btz363
[12] Y. Mishina, R. Murata, Y. Yamauchi, T. Yamashita, and H. Fujiyoshi, "Boosted Random Forest," IEICE Transactions on Information and Systems, vol. E98.D, no. 9, pp. 1630-1636, Sep. 2015. [Online]. Available: https://www.jstage.jst.go.jp/article/transinf/E98.D/9/E98.D_2014OPP0004/_article/-char/en. DOI: 10.1587/transinf.2014OPP0004.
[13] X. Liu, S. J. C. Gosline, L. T. Pflieger, P. Wallet, A. Iyer, J. Guinney, A. H. Bild, and J. T. Chang, "Knowledge-based classification of fine-grained immune cell types in single-cell RNA-Seq data," Briefings in Bioinformatics, vol. 22, no. 5, Article bbab039, Sep. 2, 2021. DOI: 10.1093/bib/bbab039. PMID: 33681983; PMCID: PMC8536868.
[14] M. Schubach, M. Re, P. N. Robinson, et al., βImbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants,β Scientific Reports, vol. 7, no. 1, p. 2959, 2017. [Online]. Available: https://doi.org/10.1038/s41598-017-03011-5
