Skip to content

Birendra-Kumar-S/scanaflow

Repository files navigation

scanaflow

A streamlined, reproducible Scanpy workflow for scRNA-seq: QC → normalize → cluster → annotate

Repository · Notebook · Outputs

GitHub last commit GitHub repo size GitHub stars GitHub forks GitHub issues

Python Scanpy AnnData Notebook


Table of contents


What is scanaflow?

scanaflow is an end-to-end Scanpy workflow for single-cell RNA-seq analysis, packaged as a clean, reusable reference pipeline:

✅ Quality control (QC)
✅ Filtering (cells/genes, mitochondrial fraction)
✅ Normalization + log-transform
✅ Highly-variable genes (HVGs)
✅ PCA → kNN graph → UMAP
✅ Leiden clustering
✅ Marker discovery & annotation helpers

This repo includes a ready-to-run PBMC example dataset and a complete notebook so you can reproduce results and adapt the workflow to your own data.


Repository structure

scanaflow/
├─ scanpy_PBMC_pipeline.ipynb
├─ 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
├─ Scanpy_output/
├─ .gitignore
└─ .gitattributes
  • Notebook: scanpy_PBMC_pipeline.ipynb
  • Example dataset: 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
  • Outputs folder: Scanpy_output/ (plots/tables/results generated by the notebook)

Quickstart

1) Clone the repo

git clone https://github.com/Birendra-Kumar-S/scanaflow.git
cd scanaflow

2) Create an environment

Option A — Conda (recommended)

conda create -n scanaflow python=3.10 -y
conda activate scanaflow

pip install --upgrade pip
pip install scanpy anndata leidenalg igraph numpy pandas scipy matplotlib jupyter

Option B — venv + pip

python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows PowerShell

pip install --upgrade pip
pip install scanpy anndata leidenalg igraph numpy pandas scipy matplotlib jupyter

3) Run the notebook

jupyter notebook

Open and run: • scanpy_PBMC_pipeline.ipynb

All generated artifacts should be written under: • Scanpy_output/

Workflow overview

This notebook follows the standard Scanpy clustering workflow commonly used for scRNA-seq: 1. Load 10x input into AnnData 2. Compute QC metrics 3. Filter low-quality cells/genes and high-mito cells 4. Normalize total counts and log1p 5. Select HVGs 6. Scale and run PCA 7. Build neighbors graph 8. Embed with UMAP 9. Cluster using Leiden 10. Identify marker genes per cluster to support annotation

Reference: • Scanpy clustering tutorial: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html

Input data

Included example

This repo includes a 10x-formatted .h5 matrix: • 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5

Use your own dataset

If you have a 10x .h5

import scanpy as sc
adata = sc.read_10x_h5("path/to/filtered_feature_bc_matrix.h5")

If you have a 10x mtx/ directory

import scanpy as sc
adata = sc.read_10x_mtx(
    "path/to/filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True
)

If you have an .h5ad

import scanpy as sc
adata = sc.read_h5ad("path/to/data.h5ad")

Outputs

All outputs are stored in:

📁 Scanpy_output/ https://github.com/Birendra-Kumar-S/scanaflow/tree/main/Scanpy_output

What you should expect in Scanpy_output/

Depending on notebook execution, common outputs include: • QC plots (e.g., counts/genes/mito distributions) • PCA variance plots / PCA embedding • UMAP plots (colored by clusters/QC metrics/markers) • Marker gene rankings (tables) • Optional processed objects (e.g., .h5ad)

✅ Paste the actual output filenames here (recommended)

Run this locally from the repo root:

ls -1 Scanpy_output

Then paste the results into this list: • Scanpy_output/<file_1> • Scanpy_output/<file_2> • Scanpy_output/<file_3> • …

Tip: This makes your README look “complete” and helps recruiters instantly see what your pipeline produces.

Key parameters to tune

Most datasets require threshold tuning. Common knobs include:

QC & filtering

•	min_genes, min_counts
•	mitochondrial gene prefix (human: MT-)
•	max % mitochondrial
•	removal of extreme high-count outliers (potential doublets)

Preprocessing

•	normalization target sum (often 1e4)
•	HVGs: n_top_genes, selection method
•	scaling cap: max_value

Graph/embedding/clustering

•	n_pcs, n_neighbors
•	Leiden resolution (cluster granularity)

Reproducibility

To make results stable across machines: • Pin versions in requirements.txt or environment.yml • Use random_state where applicable (UMAP, some steps) • Log versions in notebook output

Example requirements.txt template:

scanpy
anndata
numpy
scipy
pandas
matplotlib
leidenalg
igraph
jupyter

Troubleshooting

No module named 'leidenalg' / clustering fails

pip install leidenalg igraph

Figures not saving to Scanpy_output/ • Run notebook from the repo root • Confirm Scanpy_output/ exists

pwd
ls
ls Scanpy_output

Memory issues on large data • filter earlier • reduce HVGs • lower n_pcs • consider backed AnnData for large objects

Roadmap

If you want to evolve scanaflow into a portfolio-grade pipeline repo: • Add pinned requirements.txt / environment.yml • Add a config file (YAML) for QC + clustering params • Convert notebook → script + CLI (python -m ...) • Add CI (GitHub Actions) to run a small test dataset • Add HTML report export (nbconvert / Quarto) • Add automated annotation (marker scoring / reference mapping)

Citation

If you use this workflow in your work, please cite: • Scanpy documentation / methodology: https://scanpy.readthedocs.io/

Contact

Saye Birendra Kumar • GitHub: https://github.com/Birendra-Kumar-S • Repo: https://github.com/Birendra-Kumar-S/scanaflow

About

scanaflow — a streamlined, reproducible Scanpy workflow for scRNA-seq: QC → normalize → cluster → annotate.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors