Skip to content

FrancisCrickInstitute/spc-data-explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

SPC Data Explorer

An interactive dashboard for exploring Cell Painting phenotypic screening data, built with Dash and Plotly. Supports both SPC (Spherical Phenotype Clustering) and CellProfiler analysis pipelines with unified visualisation capabilities.

Overview

This application provides an interactive web-based interface for exploring high-content screening data from Cell Painting assays. It was developed to:

  • Visualise compound phenotypes in reduced dimensionality space (UMAP/t-SNE)
  • Compare analysis methods by switching between SPC and CellProfiler pipelines
  • Identify phenotypic neighbours through landmark-based distance analysis
  • Explore compound metadata including MOA, target annotations, and chemical structures

The dashboard integrates multiple data sources including morphological features, compound annotations, chemoproteomics data, and microscopy images to provide a comprehensive view of phenotypic screening results.


Features

Dual Pipeline Support

The application supports data from two distinct analysis pipelines:

  • SPC (Spherical Phenotype Clustering): A machine learning approach using ResNet-based feature extraction and cosine similarity metrics
  • CellProfiler: Traditional morphological profiling with standardised feature sets

Each pipeline has its own column naming conventions, which the app handles transparently through configurable column mappings.

Interactive Visualisation

  • UMAP/t-SNE scatter plots for both SPC and CellProfiler datasets
  • Dynamic colour mapping by:
    • Library source (GSK, JUMP, SGC, etc.)
    • Mechanism of Action (MOA)
    • Landmark proximity status
    • Plate/well location
    • Various distance metrics
  • Compound search with autocomplete supporting:
    • PP_ID (e.g., PPXXXX@1.0)
    • Treatment names (e.g., CompoundXXXX@0.1)
    • MOA/gene names (e.g., UNG@0.1 (CR000023@0.1))
  • Visual highlighting of selected compounds on the plot

Microscopy Image Integration

  • Hover preview: See microscopy thumbnails instantly when hovering over data points
  • Click for details: Full compound information panel with larger image
  • Multiple scaling modes: Fixed (comparable across images) or auto-scaled (per-image optimisation)
  • Text overlays: Optional treatment or MOA labels on images
  • Multi-site support: Random site selection from available fields of view

Landmark Analysis

Reference compounds ("landmarks") with known mechanisms serve as anchors for phenotypic interpretation:

  • Distance calculations to three closest landmarks for each compound
  • Validity indicators showing if compounds fall within meaningful distance thresholds
  • Detailed landmark information including:
    • MOA/target annotations
    • PP_ID identifiers
    • Cosine distances
    • Broad Institute annotations

Rich Metadata Display

Hover and click interactions reveal comprehensive compound information:

  • Basic info: Treatment, plate, well, concentration, library
  • Annotations: MOA, target description, Broad annotation
  • Chemical structure: Rendered from SMILES using RDKit
  • Chemoproteomics: Protein targets from pulldown experiments
  • Gene descriptions: Functional annotations for target genes
  • Distance metrics: MAD cosine, variance, standard deviation measures

Data Sources

This visualisation app displays data generated by two upstream analysis pipelines:

SPC Analysis Pipeline

  • Repository: spc-cosine-analysis (TBD - update link)
  • Description: Spherical Phenotype Clustering using ResNet feature extraction and cosine similarity analysis
  • Output: Parquet files with UMAP/t-SNE coordinates, landmark distances, and unified metrics

CellProfiler Analysis Pipeline

  • Repository: cellprofiler_processing (TBD - update link)
  • Description: Traditional CellProfiler morphological profiling with MAD normalisation
  • Output: Parquet files with Metadata_ prefixed columns and landmark analysis results

Project Structure

spc-data-explorer/
└── scripts/
    ├── main.py                      # Application entry point
    ├── config_loader.py             # Configuration management (singleton pattern)
    ├── environment.yml              # Conda environment specification
    ├── requirements.txt             # Pip dependencies (alternative to conda)
    │
    ├── callbacks/
    │   ├── __init__.py
    │   ├── plot_callbacks.py        # Main scatter plot generation and updates
    │   ├── image_callbacks.py       # Hover/click image display with metadata
    │   ├── search_callbacks.py      # MOA-based compound search functionality
    │   ├── detailed_search_callbacks.py  # Advanced search with multiple criteria
    │   └── landmark_callbacks.py    # Landmark analysis modal and calculations
    │
    ├── components/
    │   ├── __init__.py
    │   ├── layout.py                # Dashboard layout structure
    │   ├── controls.py              # UI control panels (dropdowns, sliders)
    │   └── search.py                # Search component builders
    │
    ├── config/
    │   └── config_20251118_TEST_INPUTS.py  #  RECOMMENDED: Latest config
    │
    ├── data/
    │   ├── __init__.py
    │   ├── loader.py                # Main data loading with column harmonisation
    │   └── landmark_loader.py       # Landmark data processing and validation
    │
    ├── utils/
    │   ├── __init__.py
    │   ├── color_utils.py           # Colour palette management for categories
    │   ├── image_utils.py           # Thumbnail finding and image processing
    │   └── smiles_utils.py          # Chemical structure rendering via RDKit
    │
    └── generate_thumbnails/
        ├── scripts/
        │   └── generate_thumbnails_perc_and_auto_thresh_V1.py  # Thumbnail generator
        └── submit/
            └── thumbnails_*.sh      # SLURM submission scripts for each dataset

Key Configuration Note

Use config_20251118_TEST_INPUTS.py as your starting point.

This is the latest configuration file that includes:

  • Separate loading logic for SPC and CellProfiler datasets
  • Correct column name mappings for both pipelines (e.g., plate vs Metadata_plate_barcode)
  • Hover column definitions for each data type
  • Plot type configurations for all four views (SPC UMAP/t-SNE, CP UMAP/t-SNE)

Thumbnail Generation

The generate_thumbnails/ directory contains scripts for creating RGB thumbnail images from multi-channel Cell Painting microscopy data. These thumbnails are displayed in the dashboard when hovering over or clicking on data points.

Overview

Cell Painting assays typically acquire 4-5 fluorescent channels per field of view. The thumbnail generator combines these channels into false-colour RGB thumbnails (500×500 pixels) suitable for quick visual inspection.

Scaling Modes

The script produces two versions of each thumbnail:

Mode Directory Description Best For
Fixed fixed/ Pre-defined intensity limits based on dataset-wide percentiles Comparing phenotypes across treatments, identifying outliers
Auto auto/ Per-image 1st-99th percentile scaling Examining morphological details, dim images, QC checking

Channel Mapping

Fluorescent channels are mapped to RGB colours:

  • Blue: Nuclear stains (HOECHST 33342, DAPI)
  • Green: Alexa 488, FITC (ER, actin, cytoplasmic markers)
  • Red: Alexa 568, MitoTracker Deep Red, Cy5 (mitochondria, membrane)

Usage

Basic usage:

python generate_thumbnails_perc_and_auto_thresh_V1.py \
    /path/to/max_projected_images \
    /path/to/output/thumbnails \
    --scaling both

Scan directories first (planning mode):

python generate_thumbnails_perc_and_auto_thresh_V1.py \
    /path/to/images \
    /path/to/thumbnails \
    --scan-only \
    --input-dirs /other/path1 /other/path2

SLURM Submission

For HPC environments, use the submission scripts in submit/:

sbatch thumbnails_20251020_HaCaT_HTC_V1_V2_cell_paint.sh

Example SLURM configuration:

#SBATCH --job-name=thumbnails
#SBATCH --time=168:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --partition=ncpu

Output Structure

thumbnails/
├── fixed/                    # Fixed intensity scaling
│   ├── {plate_barcode}/
│   │   ├── {plate}_{well}_{site}.png
│   │   └── ...
│   └── ...
└── auto/                     # Auto-scaled per image
    ├── {plate_barcode}/
    │   └── ...
    └── ...

Installation

Prerequisites

  • Python 3.11+
  • Conda (recommended) or pip
  • Access to data files (parquet) and thumbnail images

Setup with Conda (Recommended)

# Clone the repository
git clone https://github.com/YOUR_USERNAME/spc-data-explorer.git
cd spc-data-explorer/scripts

# Create environment from file
conda env create -f environment.yml

# Activate environment
conda activate spc_visualisation

Setup with Pip (Alternative)

# Clone the repository
git clone https://github.com/YOUR_USERNAME/spc-data-explorer.git
cd spc-data-explorer/scripts

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

RDKit Note

RDKit is required for chemical structure rendering. It's easiest to install via conda:

conda install -c conda-forge rdkit

Configuration

1. Create Your Configuration File

# Copy the recommended config
cp config/config_20251118_TEST_INPUTS.py config/config_myproject.py

2. Update Paths

Edit your configuration file to point to your data:

class Config:
    # Data paths - update these!
    SPC_DATA_PATH = Path("/path/to/spc_analysis_output.parquet")
    CP_DATA_PATH = Path("/path/to/cellprofiler_output.parquet")
    
    # Thumbnail directory (should contain 'fixed/' and 'auto/' subdirectories)
    THUMBNAIL_DIRS = Path("/path/to/thumbnails")

3. Environment-Specific Paths

The template supports automatic path switching based on username:

if os.environ.get("USER", "") == "your_cluster_username":
    # Cluster paths (e.g., /nemo/...)
    ANALYSIS_DIR = Path("/nemo/path/to/analysis")
else:
    # Local paths (e.g., mounted volumes)
    ANALYSIS_DIR = Path("/Volumes/path/to/analysis")

Usage

Starting the Application

cd scripts/

# Interactive config selection (prompts you to choose)
python main.py

# Or specify config directly
python main.py --config config_myproject

# Or use environment variable
export SPC_CONFIG=config_myproject
python main.py

The app will start at http://127.0.0.1:8090 (or the port specified in your config).

Dashboard Navigation

  1. Select Plot Type: Choose from:

    • SPC UMAP / t-SNE
    • CellProfiler UMAP / t-SNE
    • Custom axes
  2. Colour By: Select metadata column for point colouring:

    • Library, MOA, landmark status
    • Plate, well location
    • Distance metrics (continuous colour scales)
  3. Search Compounds: Type to search by:

    • Compound ID: PPXXXX
    • Treatment: CompoundXXXX
    • Gene/MOA: UNG → shows UNG@0.1 (CR000023@0.1)
  4. Interact with Plot:

    • Hover: See microscopy image preview + key metadata
    • Click: Open detailed compound panel with full information
    • Zoom/Pan: Standard Plotly interactions
  5. Adjust Settings:

    • Point size slider
    • Image scaling mode (fixed/auto)
    • Optional text labels on images

Input Data Format

SPC Dataset Required Columns

Column Description
UMAP1, UMAP2 UMAP coordinates
TSNE1, TSNE2 t-SNE coordinates
plate, well Plate and well identifiers
treatment Treatment identifier
PP_ID, PP_ID_uM Compound ID and with concentration
library Source library
moa_first, moa_compound_uM Mechanism of action
closest_landmark_* Landmark distance data

CellProfiler Dataset Required Columns

Column Description
UMAP1, UMAP2 UMAP coordinates
TSNE1, TSNE2 t-SNE coordinates
Metadata_plate_barcode Plate identifier
Metadata_well Well identifier
Metadata_PP_ID, Metadata_PP_ID_uM Compound identifiers
Metadata_library Source library
Metadata_annotated_target_first MOA/target
closest_landmark_Metadata_* Landmark data

Thumbnail Directory Structure

thumbnails/
├── fixed/           # Fixed intensity scaling (comparable)
│   ├── plate1/
│   │   ├── plate1_A01_01.png
│   │   ├── plate1_A01_02.png
│   │   └── ...
│   └── plate2/
└── auto/            # Auto-scaled per image
    ├── plate1/
    └── plate2/

Development

Adding New Colour Options

Edit utils/color_utils.py to add new colour column configurations:

color_columns = [
    ('new_column', False, 'Display Name', px.colors.qualitative.Set1),
    # (column_name, is_continuous, display_label, colour_palette)
]

Adding New Hover Fields

Update your config file's get_hover_columns() and get_hover_display() methods to include additional fields in the hover template.

Customising the Layout

The dashboard layout is defined in components/layout.py. Modify this file to add new panels or rearrange existing components.


Troubleshooting

Common Issues

"No data available" error

  • Check that your data paths in the config file are correct
  • Verify the parquet files exist and are readable
  • Ensure required columns are present in your data

Images not displaying

  • Verify thumbnail directory path is correct
  • Check that fixed/ and auto/ subdirectories exist
  • Confirm image naming convention: {plate}_{well}_{site}.png

Slow performance with large datasets

  • Consider filtering data before loading
  • Reduce the number of hover columns
  • Use server-side pagination for very large datasets

RDKit import errors

  • Install RDKit via conda: conda install -c conda-forge rdkit
  • If using pip, RDKit installation can be complex - conda is recommended

About

Dashboard for exploring cell painting data from Cell Profiler & AI/ML models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors