A streamlined, reproducible Scanpy workflow for scRNA-seq: QC → normalize → cluster → annotate
Repository · Notebook · Outputs
- What is scanaflow?
- Repository structure
- Quickstart
- Workflow overview
- Input data
- Outputs
- Key parameters to tune
- Reproducibility
- Troubleshooting
- Roadmap
- Citation
- Contact
scanaflow is an end-to-end Scanpy workflow for single-cell RNA-seq analysis, packaged as a clean, reusable reference pipeline:
✅ Quality control (QC)
✅ Filtering (cells/genes, mitochondrial fraction)
✅ Normalization + log-transform
✅ Highly-variable genes (HVGs)
✅ PCA → kNN graph → UMAP
✅ Leiden clustering
✅ Marker discovery & annotation helpers
This repo includes a ready-to-run PBMC example dataset and a complete notebook so you can reproduce results and adapt the workflow to your own data.
scanaflow/
├─ scanpy_PBMC_pipeline.ipynb
├─ 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
├─ Scanpy_output/
├─ .gitignore
└─ .gitattributes
- Notebook:
scanpy_PBMC_pipeline.ipynb - Example dataset:
10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5 - Outputs folder:
Scanpy_output/(plots/tables/results generated by the notebook)
git clone https://github.com/Birendra-Kumar-S/scanaflow.git
cd scanaflowOption A — Conda (recommended)
conda create -n scanaflow python=3.10 -y
conda activate scanaflow
pip install --upgrade pip
pip install scanpy anndata leidenalg igraph numpy pandas scipy matplotlib jupyterOption B — venv + pip
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows PowerShell
pip install --upgrade pip
pip install scanpy anndata leidenalg igraph numpy pandas scipy matplotlib jupyterjupyter notebookOpen and run: • scanpy_PBMC_pipeline.ipynb
All generated artifacts should be written under: • Scanpy_output/
This notebook follows the standard Scanpy clustering workflow commonly used for scRNA-seq: 1. Load 10x input into AnnData 2. Compute QC metrics 3. Filter low-quality cells/genes and high-mito cells 4. Normalize total counts and log1p 5. Select HVGs 6. Scale and run PCA 7. Build neighbors graph 8. Embed with UMAP 9. Cluster using Leiden 10. Identify marker genes per cluster to support annotation
Reference: • Scanpy clustering tutorial: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html
Included example
This repo includes a 10x-formatted .h5 matrix: • 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
Use your own dataset
If you have a 10x .h5
import scanpy as sc
adata = sc.read_10x_h5("path/to/filtered_feature_bc_matrix.h5")If you have a 10x mtx/ directory
import scanpy as sc
adata = sc.read_10x_mtx(
"path/to/filtered_feature_bc_matrix/",
var_names="gene_symbols",
cache=True
)If you have an .h5ad
import scanpy as sc
adata = sc.read_h5ad("path/to/data.h5ad")All outputs are stored in:
📁 Scanpy_output/ https://github.com/Birendra-Kumar-S/scanaflow/tree/main/Scanpy_output
What you should expect in Scanpy_output/
Depending on notebook execution, common outputs include: • QC plots (e.g., counts/genes/mito distributions) • PCA variance plots / PCA embedding • UMAP plots (colored by clusters/QC metrics/markers) • Marker gene rankings (tables) • Optional processed objects (e.g., .h5ad)
✅ Paste the actual output filenames here (recommended)
Run this locally from the repo root:
ls -1 Scanpy_outputThen paste the results into this list: • Scanpy_output/<file_1> • Scanpy_output/<file_2> • Scanpy_output/<file_3> • …
Tip: This makes your README look “complete” and helps recruiters instantly see what your pipeline produces.
Most datasets require threshold tuning. Common knobs include:
• min_genes, min_counts
• mitochondrial gene prefix (human: MT-)
• max % mitochondrial
• removal of extreme high-count outliers (potential doublets)
• normalization target sum (often 1e4)
• HVGs: n_top_genes, selection method
• scaling cap: max_value
• n_pcs, n_neighbors
• Leiden resolution (cluster granularity)
⸻
To make results stable across machines: • Pin versions in requirements.txt or environment.yml • Use random_state where applicable (UMAP, some steps) • Log versions in notebook output
Example requirements.txt template:
scanpy
anndata
numpy
scipy
pandas
matplotlib
leidenalg
igraph
jupyterNo module named 'leidenalg' / clustering fails
pip install leidenalg igraphFigures not saving to Scanpy_output/ • Run notebook from the repo root • Confirm Scanpy_output/ exists
pwd
ls
ls Scanpy_outputMemory issues on large data • filter earlier • reduce HVGs • lower n_pcs • consider backed AnnData for large objects
If you want to evolve scanaflow into a portfolio-grade pipeline repo: • Add pinned requirements.txt / environment.yml • Add a config file (YAML) for QC + clustering params • Convert notebook → script + CLI (python -m ...) • Add CI (GitHub Actions) to run a small test dataset • Add HTML report export (nbconvert / Quarto) • Add automated annotation (marker scoring / reference mapping)
If you use this workflow in your work, please cite: • Scanpy documentation / methodology: https://scanpy.readthedocs.io/
Saye Birendra Kumar • GitHub: https://github.com/Birendra-Kumar-S • Repo: https://github.com/Birendra-Kumar-S/scanaflow