Scale Barcode Parsing Pipeline

A Python-based pipeline for processing long-read sequencing data to identify, orient, trim, and parse barcode elements from FASTQ files.

Overview

This script processes long-read sequencing data from FASTQ files to:

Identify and orient reads based on primer sequences
Detect and trim polyA/T tails
Parse barcode elements (barcodes, UMIs, adapters, etc.)
Segregate reads into passing and failing files
Generate detailed processing logs

Installation

1. Install Conda Environment

Create the conda environment from the provided YAML file:

conda env create -f scalebcparser.yaml

2. Activate the Environment

conda activate bcparser

3. Verify Installation

Check that all dependencies are installed:

python -c "import parasail, pyfastx, yaml, Levenshtein, rich; print('All dependencies installed successfully!')"

Usage

Basic Usage

python scale_bc_parser.py \
  -i input.fastq.gz \
  -o output_prefix \
  -c config.yaml

Full Command-Line Options

Required Arguments:
  -i, --input           Comma-separated list of input FASTQ files (can be gzipped)
  -o, --output_prefix   Output prefix for FASTQ files (outputs will be gzipped)
  -c, --config          YAML configuration file defining barcode structure

Optional Arguments:
  --header_template     Override header template from YAML configuration
                        (default: uses template from config file)
  --trim_polyA          Enable trimming of polyA/T tails from transcript sequence
                        (default: False - polyA/T tails are NOT trimmed)
  --test_run            Process and print only 50 reads with colorful outputs (for testing)
                        (default: False - processes all reads)
  --verbose             Turn on verbose logging for interactive runs
                        (default: False - minimal output)

Default Parameters

The following default parameters are used unless overridden:

Minimum polyA/T length: 20 bp (set in YAML config file)
Levenshtein distance threshold: 2 (for barcode matching)
Hamming distance threshold: 2 (for tie-breaking)
Base tolerance: 5 bp (for linker-to-3' primer distance validation)
Max alignment hits: 3 for primers, 5 for linkers
Alignment scoring: Match = 5, Mismatch = -1, N-to-base = 2
Gap penalties: Open = 10, Extend = 1

Example Commands

1. Standard Processing

Process a single FASTQ file with default settings:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o processed_sample \
  -c barcode_config.yaml

2. Multiple Input Files

Process multiple FASTQ files in a single run:

python scale_bc_parser.py \
  -i file1.fastq.gz,file2.fastq.gz,file3.fastq.gz \
  -o combined_output \
  -c barcode_config.yaml

3. With PolyA Trimming

Enable polyA/T tail trimming:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o processed_sample \
  -c barcode_config.yaml \
  --trim_polyA

4. Test Run

Run on a subset of reads (50) with verbose output for testing:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o test_output \
  -c barcode_config.yaml \
  --test_run

5. Custom Header Template

Override the header template defined in the config file:

python scale_bc_parser.py \
  -i sample_reads.fastq.gz \
  -o processed_sample \
  -c barcode_config.yaml \
  --header_template "{existing_read_name} BC={barcode_id} UMI={umi}"

Header Formatting

Available Template Variables

The header template uses Python string formatting with curly braces {} to insert parsed barcode information. Available variables include:

{existing_read_name} - Original read ID (without spaces or description)
{existing_header} - Full original header line
Any element name from your config - e.g., {umi}, {rt_scale}, {lig_bc_scale}

Header Template Examples

Example 1: SAM/BAM compatible format with cell barcode and UMI

header_template: "{existing_read_name} UB:Z:{umi} CB:Z:{rt_scale}{lig_bc_scale}"

Output: @read_001 UB:Z:CAAAACTG CB:Z:TACTATTAGAAAAGGCTGG

Example 2: Simple space-separated format

header_template: "{existing_read_name} BC={rt_scale}{lig_bc_scale} UMI={umi}"

Output: @read_001 BC=TACTATTAGAAAAGGCTGG UMI=CAAAACTG

Example 3: Tab-separated format (use \t)

header_template: "{existing_read_name}\tBC:{rt_scale}{lig_bc_scale}\tUMI:{umi}"

Output: @read_001 BC:TACTATTAGAAAAGGCTGG UMI:CAAAACTG

Output Modes

Each element in the configuration can specify how it should appear in the output header:

sequence - Output the actual nucleotide sequence
header - Output the name/ID from the whitelist (for whitelisted elements)
skip - Don't include this element in the output header

Output Files

The pipeline generates three output files:

{output_prefix}-passing.fastq.gz - Reads that successfully passed all filters with parsed barcodes in headers
{output_prefix}-failing.fastq.gz - Reads that failed orientation or barcode parsing
{output_prefix}.log - Detailed processing log with per-read statistics and summary

Log File Format

The log file contains:

Individual read processing results (PASS/FAIL status)
Orientation information or failure reasons
Parsed barcode assignments for each read
Summary statistics at the end

Configuration File

The script requires a YAML configuration file that defines the expected sequence structure, barcode whitelists, and output formatting. The configuration consists of two main sections: sequence_structure and output_format.

YAML Structure Overview

sequence_structure:
  - name: element_name          # Unique identifier
    type: restricted/unrestricted
    restriction_type: whitelist/length  # For restricted elements
    options: [SEQ1, SEQ2]       # Inline sequences (OR)
    options_file: path/to/file  # FASTA file with sequences
    length: N                   # Expected length in bp

output_format:
  include_fields:
    - name: element_name
      output_mode: sequence/header/skip
  header_template: "format string"

# Optional
min_poly_length: 20  # Minimum polyA/T tail length (default: 20)

Sequence Structure Elements

The sequence_structure section defines the expected order of elements in your reads, from 5' to 3'.

Element Types

1. Restricted Elements (type: restricted)

Have a known sequence or set of sequences
Can use whitelists or length constraints

2. Unrestricted Elements (type: unrestricted)

Variable sequence (e.g., cDNA transcript)
Must have exactly ONE transcript element in your structure

Restriction Types

1. Whitelist (restriction_type: whitelist)

Element must match one of the provided sequences
Sequences can be provided inline with options: or from a file with options_file:
Uses Levenshtein and Hamming distance for fuzzy matching

2. Length (restriction_type: length)

Only checks that the element is the expected length
Accepts any sequence of that length (e.g., for UMIs)

Example Configuration: ont_scale.yaml

Here's a complete example for Scale Bioscience + ONT:

sequence_structure:
  # 5' TSO primer (known sequence)
  - name: tso
    type: restricted
    restriction_type: whitelist
    options:
      - AAGCAGTGGTATCAACGCAGAGTACATGGG
    length: 30

  # cDNA transcript (variable sequence)
  - name: transcript
    type: unrestricted

  # RT barcode from whitelist file
  - name: rt_scale
    type: restricted
    restriction_type: whitelist
    options_file: "rt_scale.5p_3p.fasta"
    length: 10

  # UMI (any 8bp sequence)
  - name: umi
    type: restricted
    restriction_type: length
    length: 8

  # Linker (known sequence)
  - name: linker
    type: restricted
    restriction_type: whitelist
    options:
      - GCTCTGA
    length: 7

  # Ligation barcode from whitelist file
  - name: lig_bc_scale
    type: restricted
    restriction_type: whitelist
    options_file: "lig_bc_scale.5p_3p.fasta"
    length: 9

  # 3' RTP primer (known sequence)
  - name: rtp
    type: restricted
    restriction_type: whitelist
    options:
      - ACACTCTTTCCCTACACGACGCTCTTCCGATCT
    length: 33

output_format:
  include_fields:
    - name: umi
      output_mode: sequence     # Output nucleotide sequence

    - name: rt_scale
      output_mode: sequence     # Output nucleotide sequence

    - name: lig_bc_scale
      output_mode: sequence     # Output nucleotide sequence

  # SAM/BAM compatible format
  header_template: "{existing_read_name} UB:Z:{umi} CB:Z:{rt_scale}{lig_bc_scale}"

Key Points

Element Order Matters: List elements in their 5' to 3' order in the read
Transcript Element Required: Must have exactly one transcript element
First and Last Elements: Should be your 5' and 3' primers (used for orientation)
Linker Element: If present, used for additional validation of structure
Options vs Options File: Use options: for a few sequences, options_file: for many
Output Only What You Need: Elements not listed in include_fields are still validated but not included in the output header

Barcode Whitelist Files

If using options_file:, provide a FASTA file with barcode sequences in 5' to 3' orientation:

>barcode_01
AAAAAAAAAA
>barcode_02
TTTTTTTTTT
>barcode_03
CCCCCCCCCC

Important Notes:

All sequences must be in 5' to 3' orientation
The pipeline automatically handles reverse complementation during orientation detection
When output_mode: sequence - outputs the nucleotide sequence
When output_mode: header - outputs the FASTA header (e.g., "barcode_01")

Requirements

Python Version

Python 3.11

Dependencies

parasail-python - Smith-Waterman alignment
pyfastx - Fast FASTQ parsing
pyyaml - YAML configuration parsing
levenshtein - Sequence distance calculations
rich - Colored terminal output
ipykernel - Jupyter notebook support (optional)

All dependencies are automatically installed via the conda environment.

How It Works

The pipeline processes each read through the following steps:

1. Orientation Detection

Aligns 5' and 3' primers (and their reverse complements) to the read
Detects polyA and polyT tails
Identifies linker sequences
Determines if the read is in forward or reverse orientation
Validates the expected distance between linker and 3' primer

2. Sequence Trimming

Extracts the cDNA transcript between primers
Optionally trims polyA/T tails (if --trim_polyA is enabled)
Extracts the barcode region for parsing

3. Barcode Parsing

Identifies each barcode element using Smith-Waterman alignment
For whitelisted elements: matches against known sequences using:
- Levenshtein distance (edit distance) for fuzzy matching
- Hamming distance for tie-breaking when multiple matches exist
For length-only elements: extracts sequence of expected length

4. Read Classification

Reads are classified as PASS or FAIL:

PASS criteria:

Successfully oriented (forward or reverse)
All barcode elements identified
All whitelisted barcodes match within threshold

FAIL reasons:

ambiguous_orientation - Cannot determine read orientation
{element_name}-min_lv_dist_{N}_exceeds_threshold_{T} - Barcode doesn't match whitelist
{element_name}-ambiguous_lv_match_and_ham_dist_exceeds_threshold_{T} - Multiple equally-good matches

5. Output Generation

Passing reads: Written with parsed barcodes in header (trimmed sequence)
Failing reads: Written with original sequence and failure reason
Log file: Detailed per-read results and summary statistics

Troubleshooting

Low passing rate

Use --test_run flag to visualize alignment results for the first 50 reads
Check that your YAML config matches your actual library design
Verify primer sequences in the configuration match your protocol
Check that barcode whitelist files are accessible and formatted correctly

Reads failing orientation

Ensure both 5' and 3' primers are correct
Check that the linker sequence is accurate
Verify min_poly_length is appropriate for your data

Reads failing barcode parsing

Check that barcode lengths in config match actual lengths
Verify whitelist files contain the expected sequences
Consider if Levenshtein/Hamming thresholds need adjustment (hardcoded at 2)

ImportError for dependencies

Ensure the conda environment is activated:

conda activate bcparser

Memory issues with large files

The pipeline processes reads sequentially and should handle large files efficiently. If issues persist, process files individually rather than in batch.

Citation

If you use this pipeline in your research, please cite appropriately.

License

[Add your license information here]

Contact

[Add your contact information here]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
ont_scale.yaml		ont_scale.yaml
scale_bc_parser.py		scale_bc_parser.py
scalebcparser.yaml		scalebcparser.yaml

Folders and files

Latest commit

History

Repository files navigation

Scale Barcode Parsing Pipeline

Overview

Installation

1. Install Conda Environment

2. Activate the Environment

3. Verify Installation

Usage

Basic Usage

Full Command-Line Options

Default Parameters

Example Commands

1. Standard Processing

2. Multiple Input Files

3. With PolyA Trimming

4. Test Run

5. Custom Header Template

Header Formatting

Available Template Variables

Header Template Examples

Output Modes

Output Files

Log File Format

Configuration File

YAML Structure Overview

Sequence Structure Elements

Element Types

Restriction Types

Example Configuration: ont_scale.yaml

Key Points

Barcode Whitelist Files

Requirements

Python Version

Dependencies

How It Works

1. Orientation Detection

2. Sequence Trimming

3. Barcode Parsing

4. Read Classification

5. Output Generation

Troubleshooting

Low passing rate

Reads failing orientation

Reads failing barcode parsing

ImportError for dependencies

Memory issues with large files

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages