scanaflow

A streamlined, reproducible Scanpy workflow for scRNA-seq: QC → normalize → cluster → annotate

What is scanaflow?

scanaflow is an end-to-end Scanpy workflow for single-cell RNA-seq analysis, packaged as a clean, reusable reference pipeline:

✅ Quality control (QC)
✅ Filtering (cells/genes, mitochondrial fraction)
✅ Normalization + log-transform
✅ Highly-variable genes (HVGs)
✅ PCA → kNN graph → UMAP
✅ Leiden clustering
✅ Marker discovery & annotation helpers

This repo includes a ready-to-run PBMC example dataset and a complete notebook so you can reproduce results and adapt the workflow to your own data.

Repository structure

scanaflow/
├─ scanpy_PBMC_pipeline.ipynb
├─ 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
├─ Scanpy_output/
├─ .gitignore
└─ .gitattributes

Notebook: scanpy_PBMC_pipeline.ipynb
Example dataset: 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
Outputs folder: Scanpy_output/ (plots/tables/results generated by the notebook)

Quickstart

1) Clone the repo

git clone https://github.com/Birendra-Kumar-S/scanaflow.git
cd scanaflow

2) Create an environment

Option A — Conda (recommended)

conda create -n scanaflow python=3.10 -y
conda activate scanaflow

pip install --upgrade pip
pip install scanpy anndata leidenalg igraph numpy pandas scipy matplotlib jupyter

Option B — venv + pip

python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows PowerShell

pip install --upgrade pip
pip install scanpy anndata leidenalg igraph numpy pandas scipy matplotlib jupyter

3) Run the notebook

jupyter notebook

Open and run: • scanpy_PBMC_pipeline.ipynb

All generated artifacts should be written under: • Scanpy_output/

Workflow overview

This notebook follows the standard Scanpy clustering workflow commonly used for scRNA-seq: 1. Load 10x input into AnnData 2. Compute QC metrics 3. Filter low-quality cells/genes and high-mito cells 4. Normalize total counts and log1p 5. Select HVGs 6. Scale and run PCA 7. Build neighbors graph 8. Embed with UMAP 9. Cluster using Leiden 10. Identify marker genes per cluster to support annotation

Reference: • Scanpy clustering tutorial: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html

Input data

Included example

This repo includes a 10x-formatted .h5 matrix: • 10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5

Use your own dataset

If you have a 10x .h5

import scanpy as sc
adata = sc.read_10x_h5("path/to/filtered_feature_bc_matrix.h5")

If you have a 10x mtx/ directory

import scanpy as sc
adata = sc.read_10x_mtx(
    "path/to/filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True
)

If you have an .h5ad

import scanpy as sc
adata = sc.read_h5ad("path/to/data.h5ad")

Outputs

All outputs are stored in:

📁 Scanpy_output/ https://github.com/Birendra-Kumar-S/scanaflow/tree/main/Scanpy_output

What you should expect in Scanpy_output/

Depending on notebook execution, common outputs include: • QC plots (e.g., counts/genes/mito distributions) • PCA variance plots / PCA embedding • UMAP plots (colored by clusters/QC metrics/markers) • Marker gene rankings (tables) • Optional processed objects (e.g., .h5ad)

✅ Paste the actual output filenames here (recommended)

Run this locally from the repo root:

ls -1 Scanpy_output

Then paste the results into this list: • Scanpy_output/<file_1> • Scanpy_output/<file_2> • Scanpy_output/<file_3> • …

Tip: This makes your README look “complete” and helps recruiters instantly see what your pipeline produces.

Key parameters to tune

Most datasets require threshold tuning. Common knobs include:

QC & filtering

•	min_genes, min_counts
•	mitochondrial gene prefix (human: MT-)
•	max % mitochondrial
•	removal of extreme high-count outliers (potential doublets)

Preprocessing

•	normalization target sum (often 1e4)
•	HVGs: n_top_genes, selection method
•	scaling cap: max_value

Graph/embedding/clustering

•	n_pcs, n_neighbors
•	Leiden resolution (cluster granularity)

⸻

Reproducibility

To make results stable across machines: • Pin versions in requirements.txt or environment.yml • Use random_state where applicable (UMAP, some steps) • Log versions in notebook output

Example requirements.txt template:

scanpy
anndata
numpy
scipy
pandas
matplotlib
leidenalg
igraph
jupyter

Troubleshooting

No module named 'leidenalg' / clustering fails

pip install leidenalg igraph

Figures not saving to Scanpy_output/ • Run notebook from the repo root • Confirm Scanpy_output/ exists

pwd
ls
ls Scanpy_output

Memory issues on large data • filter earlier • reduce HVGs • lower n_pcs • consider backed AnnData for large objects

Roadmap

If you want to evolve scanaflow into a portfolio-grade pipeline repo: • Add pinned requirements.txt / environment.yml • Add a config file (YAML) for QC + clustering params • Convert notebook → script + CLI (python -m ...) • Add CI (GitHub Actions) to run a small test dataset • Add HTML report export (nbconvert / Quarto) • Add automated annotation (marker scoring / reference mapping)

Citation

If you use this workflow in your work, please cite: • Scanpy documentation / methodology: https://scanpy.readthedocs.io/

Contact

Saye Birendra Kumar • GitHub: https://github.com/Birendra-Kumar-S • Repo: https://github.com/Birendra-Kumar-S/scanaflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scanaflow

Table of contents

What is scanaflow?

Repository structure

Quickstart

1) Clone the repo

2) Create an environment

3) Run the notebook

Workflow overview

Input data

Outputs

Key parameters to tune

QC & filtering

Preprocessing

Graph/embedding/clustering

Reproducibility

Troubleshooting

Roadmap

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Scanpy_output		Scanpy_output
.gitattributes		.gitattributes
.gitignore		.gitignore
10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5		10k_PBMC_3p_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
README.md		README.md
scanpy_PBMC_pipeline.ipynb		scanpy_PBMC_pipeline.ipynb

Folders and files

Latest commit

History

Repository files navigation

scanaflow

Table of contents

What is scanaflow?

Repository structure

Quickstart

1) Clone the repo

2) Create an environment

3) Run the notebook

Workflow overview

Input data

Outputs

Key parameters to tune

QC & filtering

Preprocessing

Graph/embedding/clustering

Reproducibility

Troubleshooting

Roadmap

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages