Skip to content

dhglab/scalebcparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scale Barcode Parsing Pipeline

A Python-based pipeline for processing long-read sequencing data to identify, orient, trim, and parse barcode elements from FASTQ files.

Overview

This script processes long-read sequencing data from FASTQ files to:

  • Identify and orient reads based on primer sequences
  • Detect and trim polyA/T tails
  • Parse barcode elements (barcodes, UMIs, adapters, etc.)
  • Segregate reads into passing and failing files
  • Generate detailed processing logs

Installation

1. Install Conda Environment

Create the conda environment from the provided YAML file:

conda env create -f scalebcparser.yaml

2. Activate the Environment

conda activate bcparser

3. Verify Installation

Check that all dependencies are installed:

python -c "import parasail, pyfastx, yaml, Levenshtein, rich; print('All dependencies installed successfully!')"

Usage

Basic Usage

python scale_bc_parser.py \
  -i input.fastq.gz \
  -o output_prefix \
  -c config.yaml

Full Command-Line Options

Required Arguments:
  -i, --input           Comma-separated list of input FASTQ files (can be gzipped)
  -o, --output_prefix   Output prefix for FASTQ files (outputs will be gzipped)
  -c, --config          YAML configuration file defining barcode structure

Optional Arguments:
  --header_template     Override header template from YAML configuration
                        (default: uses template from config file)
  --trim_polyA          Enable trimming of polyA/T tails from transcript sequence
                        (default: False - polyA/T tails are NOT trimmed)
  --test_run            Process and print only 50 reads with colorful outputs (for testing)
                        (default: False - processes all reads)
  --verbose             Turn on verbose logging for interactive runs
                        (default: False - minimal output)

Default Parameters

The following default parameters are used unless overridden:

  • Minimum polyA/T length: 20 bp (set in YAML config file)
  • Levenshtein distance threshold: 2 (for barcode matching)
  • Hamming distance threshold: 2 (for tie-breaking)
  • Base tolerance: 5 bp (for linker-to-3' primer distance validation)
  • Max alignment hits: 3 for primers, 5 for linkers
  • Alignment scoring: Match = 5, Mismatch = -1, N-to-base = 2
  • Gap penalties: Open = 10, Extend = 1

Example Commands

1. Standard Processing

Process a single FASTQ file with default settings:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o processed_sample \
  -c barcode_config.yaml

2. Multiple Input Files

Process multiple FASTQ files in a single run:

python scale_bc_parser.py \
  -i file1.fastq.gz,file2.fastq.gz,file3.fastq.gz \
  -o combined_output \
  -c barcode_config.yaml

3. With PolyA Trimming

Enable polyA/T tail trimming:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o processed_sample \
  -c barcode_config.yaml \
  --trim_polyA

4. Test Run

Run on a subset of reads (50) with verbose output for testing:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o test_output \
  -c barcode_config.yaml \
  --test_run

5. Custom Header Template

Override the header template defined in the config file:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o processed_sample \
  -c barcode_config.yaml \
  --header_template "{existing_read_name} BC={barcode_id} UMI={umi}"

Header Formatting

Available Template Variables

The header template uses Python string formatting with curly braces {} to insert parsed barcode information. Available variables include:

  • {existing_read_name} - Original read ID (without spaces or description)
  • {existing_header} - Full original header line
  • Any element name from your config - e.g., {umi}, {rt_scale}, {lig_bc_scale}

Header Template Examples

Example 1: SAM/BAM compatible format with cell barcode and UMI

header_template: "{existing_read_name} UB:Z:{umi} CB:Z:{rt_scale}{lig_bc_scale}"

Output: @read_001 UB:Z:CAAAACTG CB:Z:TACTATTAGAAAAGGCTGG

Example 2: Simple space-separated format

header_template: "{existing_read_name} BC={rt_scale}{lig_bc_scale} UMI={umi}"

Output: @read_001 BC=TACTATTAGAAAAGGCTGG UMI=CAAAACTG

Example 3: Tab-separated format (use \t)

header_template: "{existing_read_name}\tBC:{rt_scale}{lig_bc_scale}\tUMI:{umi}"

Output: @read_001 BC:TACTATTAGAAAAGGCTGG UMI:CAAAACTG

Output Modes

Each element in the configuration can specify how it should appear in the output header:

  • sequence - Output the actual nucleotide sequence
  • header - Output the name/ID from the whitelist (for whitelisted elements)
  • skip - Don't include this element in the output header

Output Files

The pipeline generates three output files:

  1. {output_prefix}-passing.fastq.gz - Reads that successfully passed all filters with parsed barcodes in headers
  2. {output_prefix}-failing.fastq.gz - Reads that failed orientation or barcode parsing
  3. {output_prefix}.log - Detailed processing log with per-read statistics and summary

Log File Format

The log file contains:

  • Individual read processing results (PASS/FAIL status)
  • Orientation information or failure reasons
  • Parsed barcode assignments for each read
  • Summary statistics at the end

Configuration File

The script requires a YAML configuration file that defines the expected sequence structure, barcode whitelists, and output formatting. The configuration consists of two main sections: sequence_structure and output_format.

YAML Structure Overview

sequence_structure:
  - name: element_name          # Unique identifier
    type: restricted/unrestricted
    restriction_type: whitelist/length  # For restricted elements
    options: [SEQ1, SEQ2]       # Inline sequences (OR)
    options_file: path/to/file  # FASTA file with sequences
    length: N                   # Expected length in bp

output_format:
  include_fields:
    - name: element_name
      output_mode: sequence/header/skip
  header_template: "format string"

# Optional
min_poly_length: 20  # Minimum polyA/T tail length (default: 20)

Sequence Structure Elements

The sequence_structure section defines the expected order of elements in your reads, from 5' to 3'.

Element Types

1. Restricted Elements (type: restricted)

  • Have a known sequence or set of sequences
  • Can use whitelists or length constraints

2. Unrestricted Elements (type: unrestricted)

  • Variable sequence (e.g., cDNA transcript)
  • Must have exactly ONE transcript element in your structure

Restriction Types

1. Whitelist (restriction_type: whitelist)

  • Element must match one of the provided sequences
  • Sequences can be provided inline with options: or from a file with options_file:
  • Uses Levenshtein and Hamming distance for fuzzy matching

2. Length (restriction_type: length)

  • Only checks that the element is the expected length
  • Accepts any sequence of that length (e.g., for UMIs)

Example Configuration: ont_scale.yaml

Here's a complete example for Scale Bioscience + ONT:

sequence_structure:
  # 5' TSO primer (known sequence)
  - name: tso
    type: restricted
    restriction_type: whitelist
    options:
      - AAGCAGTGGTATCAACGCAGAGTACATGGG
    length: 30

  # cDNA transcript (variable sequence)
  - name: transcript
    type: unrestricted

  # RT barcode from whitelist file
  - name: rt_scale
    type: restricted
    restriction_type: whitelist
    options_file: "rt_scale.5p_3p.fasta"
    length: 10

  # UMI (any 8bp sequence)
  - name: umi
    type: restricted
    restriction_type: length
    length: 8

  # Linker (known sequence)
  - name: linker
    type: restricted
    restriction_type: whitelist
    options:
      - GCTCTGA
    length: 7

  # Ligation barcode from whitelist file
  - name: lig_bc_scale
    type: restricted
    restriction_type: whitelist
    options_file: "lig_bc_scale.5p_3p.fasta"
    length: 9

  # 3' RTP primer (known sequence)
  - name: rtp
    type: restricted
    restriction_type: whitelist
    options:
      - ACACTCTTTCCCTACACGACGCTCTTCCGATCT
    length: 33

output_format:
  include_fields:
    - name: umi
      output_mode: sequence     # Output nucleotide sequence

    - name: rt_scale
      output_mode: sequence     # Output nucleotide sequence

    - name: lig_bc_scale
      output_mode: sequence     # Output nucleotide sequence

  # SAM/BAM compatible format
  header_template: "{existing_read_name} UB:Z:{umi} CB:Z:{rt_scale}{lig_bc_scale}"

Key Points

  1. Element Order Matters: List elements in their 5' to 3' order in the read
  2. Transcript Element Required: Must have exactly one transcript element
  3. First and Last Elements: Should be your 5' and 3' primers (used for orientation)
  4. Linker Element: If present, used for additional validation of structure
  5. Options vs Options File: Use options: for a few sequences, options_file: for many
  6. Output Only What You Need: Elements not listed in include_fields are still validated but not included in the output header

Barcode Whitelist Files

If using options_file:, provide a FASTA file with barcode sequences in 5' to 3' orientation:

>barcode_01
AAAAAAAAAA
>barcode_02
TTTTTTTTTT
>barcode_03
CCCCCCCCCC

Important Notes:

  • All sequences must be in 5' to 3' orientation
  • The pipeline automatically handles reverse complementation during orientation detection
  • When output_mode: sequence - outputs the nucleotide sequence
  • When output_mode: header - outputs the FASTA header (e.g., "barcode_01")

Requirements

Python Version

  • Python 3.11

Dependencies

  • parasail-python - Smith-Waterman alignment
  • pyfastx - Fast FASTQ parsing
  • pyyaml - YAML configuration parsing
  • levenshtein - Sequence distance calculations
  • rich - Colored terminal output
  • ipykernel - Jupyter notebook support (optional)

All dependencies are automatically installed via the conda environment.

How It Works

The pipeline processes each read through the following steps:

1. Orientation Detection

  • Aligns 5' and 3' primers (and their reverse complements) to the read
  • Detects polyA and polyT tails
  • Identifies linker sequences
  • Determines if the read is in forward or reverse orientation
  • Validates the expected distance between linker and 3' primer

2. Sequence Trimming

  • Extracts the cDNA transcript between primers
  • Optionally trims polyA/T tails (if --trim_polyA is enabled)
  • Extracts the barcode region for parsing

3. Barcode Parsing

  • Identifies each barcode element using Smith-Waterman alignment
  • For whitelisted elements: matches against known sequences using:
    • Levenshtein distance (edit distance) for fuzzy matching
    • Hamming distance for tie-breaking when multiple matches exist
  • For length-only elements: extracts sequence of expected length

4. Read Classification

Reads are classified as PASS or FAIL:

PASS criteria:

  • Successfully oriented (forward or reverse)
  • All barcode elements identified
  • All whitelisted barcodes match within threshold

FAIL reasons:

  • ambiguous_orientation - Cannot determine read orientation
  • {element_name}-min_lv_dist_{N}_exceeds_threshold_{T} - Barcode doesn't match whitelist
  • {element_name}-ambiguous_lv_match_and_ham_dist_exceeds_threshold_{T} - Multiple equally-good matches

5. Output Generation

  • Passing reads: Written with parsed barcodes in header (trimmed sequence)
  • Failing reads: Written with original sequence and failure reason
  • Log file: Detailed per-read results and summary statistics

Troubleshooting

Low passing rate

  • Use --test_run flag to visualize alignment results for the first 50 reads
  • Check that your YAML config matches your actual library design
  • Verify primer sequences in the configuration match your protocol
  • Check that barcode whitelist files are accessible and formatted correctly

Reads failing orientation

  • Ensure both 5' and 3' primers are correct
  • Check that the linker sequence is accurate
  • Verify min_poly_length is appropriate for your data

Reads failing barcode parsing

  • Check that barcode lengths in config match actual lengths
  • Verify whitelist files contain the expected sequences
  • Consider if Levenshtein/Hamming thresholds need adjustment (hardcoded at 2)

ImportError for dependencies

Ensure the conda environment is activated:

conda activate bcparser

Memory issues with large files

The pipeline processes reads sequentially and should handle large files efficiently. If issues persist, process files individually rather than in batch.

Citation

If you use this pipeline in your research, please cite appropriately.

License

[Add your license information here]

Contact

[Add your contact information here]

About

Python-based toolkit for flexible analysis of Scale Biosciences single-cell long-read sequencing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages