A Python-based pipeline for processing long-read sequencing data to identify, orient, trim, and parse barcode elements from FASTQ files.
This script processes long-read sequencing data from FASTQ files to:
- Identify and orient reads based on primer sequences
- Detect and trim polyA/T tails
- Parse barcode elements (barcodes, UMIs, adapters, etc.)
- Segregate reads into passing and failing files
- Generate detailed processing logs
Create the conda environment from the provided YAML file:
conda env create -f scalebcparser.yamlconda activate bcparserCheck that all dependencies are installed:
python -c "import parasail, pyfastx, yaml, Levenshtein, rich; print('All dependencies installed successfully!')"python scale_bc_parser.py \
-i input.fastq.gz \
-o output_prefix \
-c config.yamlRequired Arguments:
-i, --input Comma-separated list of input FASTQ files (can be gzipped)
-o, --output_prefix Output prefix for FASTQ files (outputs will be gzipped)
-c, --config YAML configuration file defining barcode structure
Optional Arguments:
--header_template Override header template from YAML configuration
(default: uses template from config file)
--trim_polyA Enable trimming of polyA/T tails from transcript sequence
(default: False - polyA/T tails are NOT trimmed)
--test_run Process and print only 50 reads with colorful outputs (for testing)
(default: False - processes all reads)
--verbose Turn on verbose logging for interactive runs
(default: False - minimal output)
The following default parameters are used unless overridden:
- Minimum polyA/T length: 20 bp (set in YAML config file)
- Levenshtein distance threshold: 2 (for barcode matching)
- Hamming distance threshold: 2 (for tie-breaking)
- Base tolerance: 5 bp (for linker-to-3' primer distance validation)
- Max alignment hits: 3 for primers, 5 for linkers
- Alignment scoring: Match = 5, Mismatch = -1, N-to-base = 2
- Gap penalties: Open = 10, Extend = 1
Process a single FASTQ file with default settings:
python scale_bc_parser.py \
-i sample_reads.fastq.gz \
-o processed_sample \
-c barcode_config.yamlProcess multiple FASTQ files in a single run:
python scale_bc_parser.py \
-i file1.fastq.gz,file2.fastq.gz,file3.fastq.gz \
-o combined_output \
-c barcode_config.yamlEnable polyA/T tail trimming:
python scale_bc_parser.py \
-i sample_reads.fastq.gz \
-o processed_sample \
-c barcode_config.yaml \
--trim_polyARun on a subset of reads (50) with verbose output for testing:
python scale_bc_parser.py \
-i sample_reads.fastq.gz \
-o test_output \
-c barcode_config.yaml \
--test_runOverride the header template defined in the config file:
python scale_bc_parser.py \
-i sample_reads.fastq.gz \
-o processed_sample \
-c barcode_config.yaml \
--header_template "{existing_read_name} BC={barcode_id} UMI={umi}"The header template uses Python string formatting with curly braces {} to insert parsed barcode information. Available variables include:
{existing_read_name}- Original read ID (without spaces or description){existing_header}- Full original header line- Any element name from your config - e.g.,
{umi},{rt_scale},{lig_bc_scale}
Example 1: SAM/BAM compatible format with cell barcode and UMI
header_template: "{existing_read_name} UB:Z:{umi} CB:Z:{rt_scale}{lig_bc_scale}"Output: @read_001 UB:Z:CAAAACTG CB:Z:TACTATTAGAAAAGGCTGG
Example 2: Simple space-separated format
header_template: "{existing_read_name} BC={rt_scale}{lig_bc_scale} UMI={umi}"Output: @read_001 BC=TACTATTAGAAAAGGCTGG UMI=CAAAACTG
Example 3: Tab-separated format (use \t)
header_template: "{existing_read_name}\tBC:{rt_scale}{lig_bc_scale}\tUMI:{umi}"Output: @read_001 BC:TACTATTAGAAAAGGCTGG UMI:CAAAACTG
Each element in the configuration can specify how it should appear in the output header:
sequence- Output the actual nucleotide sequenceheader- Output the name/ID from the whitelist (for whitelisted elements)skip- Don't include this element in the output header
The pipeline generates three output files:
{output_prefix}-passing.fastq.gz- Reads that successfully passed all filters with parsed barcodes in headers{output_prefix}-failing.fastq.gz- Reads that failed orientation or barcode parsing{output_prefix}.log- Detailed processing log with per-read statistics and summary
The log file contains:
- Individual read processing results (PASS/FAIL status)
- Orientation information or failure reasons
- Parsed barcode assignments for each read
- Summary statistics at the end
The script requires a YAML configuration file that defines the expected sequence structure, barcode whitelists, and output formatting. The configuration consists of two main sections: sequence_structure and output_format.
sequence_structure:
- name: element_name # Unique identifier
type: restricted/unrestricted
restriction_type: whitelist/length # For restricted elements
options: [SEQ1, SEQ2] # Inline sequences (OR)
options_file: path/to/file # FASTA file with sequences
length: N # Expected length in bp
output_format:
include_fields:
- name: element_name
output_mode: sequence/header/skip
header_template: "format string"
# Optional
min_poly_length: 20 # Minimum polyA/T tail length (default: 20)The sequence_structure section defines the expected order of elements in your reads, from 5' to 3'.
1. Restricted Elements (type: restricted)
- Have a known sequence or set of sequences
- Can use whitelists or length constraints
2. Unrestricted Elements (type: unrestricted)
- Variable sequence (e.g., cDNA transcript)
- Must have exactly ONE
transcriptelement in your structure
1. Whitelist (restriction_type: whitelist)
- Element must match one of the provided sequences
- Sequences can be provided inline with
options:or from a file withoptions_file: - Uses Levenshtein and Hamming distance for fuzzy matching
2. Length (restriction_type: length)
- Only checks that the element is the expected length
- Accepts any sequence of that length (e.g., for UMIs)
Here's a complete example for Scale Bioscience + ONT:
sequence_structure:
# 5' TSO primer (known sequence)
- name: tso
type: restricted
restriction_type: whitelist
options:
- AAGCAGTGGTATCAACGCAGAGTACATGGG
length: 30
# cDNA transcript (variable sequence)
- name: transcript
type: unrestricted
# RT barcode from whitelist file
- name: rt_scale
type: restricted
restriction_type: whitelist
options_file: "rt_scale.5p_3p.fasta"
length: 10
# UMI (any 8bp sequence)
- name: umi
type: restricted
restriction_type: length
length: 8
# Linker (known sequence)
- name: linker
type: restricted
restriction_type: whitelist
options:
- GCTCTGA
length: 7
# Ligation barcode from whitelist file
- name: lig_bc_scale
type: restricted
restriction_type: whitelist
options_file: "lig_bc_scale.5p_3p.fasta"
length: 9
# 3' RTP primer (known sequence)
- name: rtp
type: restricted
restriction_type: whitelist
options:
- ACACTCTTTCCCTACACGACGCTCTTCCGATCT
length: 33
output_format:
include_fields:
- name: umi
output_mode: sequence # Output nucleotide sequence
- name: rt_scale
output_mode: sequence # Output nucleotide sequence
- name: lig_bc_scale
output_mode: sequence # Output nucleotide sequence
# SAM/BAM compatible format
header_template: "{existing_read_name} UB:Z:{umi} CB:Z:{rt_scale}{lig_bc_scale}"- Element Order Matters: List elements in their 5' to 3' order in the read
- Transcript Element Required: Must have exactly one
transcriptelement - First and Last Elements: Should be your 5' and 3' primers (used for orientation)
- Linker Element: If present, used for additional validation of structure
- Options vs Options File: Use
options:for a few sequences,options_file:for many - Output Only What You Need: Elements not listed in
include_fieldsare still validated but not included in the output header
If using options_file:, provide a FASTA file with barcode sequences in 5' to 3' orientation:
>barcode_01
AAAAAAAAAA
>barcode_02
TTTTTTTTTT
>barcode_03
CCCCCCCCCC
Important Notes:
- All sequences must be in 5' to 3' orientation
- The pipeline automatically handles reverse complementation during orientation detection
- When
output_mode: sequence- outputs the nucleotide sequence - When
output_mode: header- outputs the FASTA header (e.g., "barcode_01")
- Python 3.11
- parasail-python - Smith-Waterman alignment
- pyfastx - Fast FASTQ parsing
- pyyaml - YAML configuration parsing
- levenshtein - Sequence distance calculations
- rich - Colored terminal output
- ipykernel - Jupyter notebook support (optional)
All dependencies are automatically installed via the conda environment.
The pipeline processes each read through the following steps:
- Aligns 5' and 3' primers (and their reverse complements) to the read
- Detects polyA and polyT tails
- Identifies linker sequences
- Determines if the read is in forward or reverse orientation
- Validates the expected distance between linker and 3' primer
- Extracts the cDNA transcript between primers
- Optionally trims polyA/T tails (if
--trim_polyAis enabled) - Extracts the barcode region for parsing
- Identifies each barcode element using Smith-Waterman alignment
- For whitelisted elements: matches against known sequences using:
- Levenshtein distance (edit distance) for fuzzy matching
- Hamming distance for tie-breaking when multiple matches exist
- For length-only elements: extracts sequence of expected length
Reads are classified as PASS or FAIL:
PASS criteria:
- Successfully oriented (forward or reverse)
- All barcode elements identified
- All whitelisted barcodes match within threshold
FAIL reasons:
ambiguous_orientation- Cannot determine read orientation{element_name}-min_lv_dist_{N}_exceeds_threshold_{T}- Barcode doesn't match whitelist{element_name}-ambiguous_lv_match_and_ham_dist_exceeds_threshold_{T}- Multiple equally-good matches
- Passing reads: Written with parsed barcodes in header (trimmed sequence)
- Failing reads: Written with original sequence and failure reason
- Log file: Detailed per-read results and summary statistics
- Use
--test_runflag to visualize alignment results for the first 50 reads - Check that your YAML config matches your actual library design
- Verify primer sequences in the configuration match your protocol
- Check that barcode whitelist files are accessible and formatted correctly
- Ensure both 5' and 3' primers are correct
- Check that the linker sequence is accurate
- Verify
min_poly_lengthis appropriate for your data
- Check that barcode lengths in config match actual lengths
- Verify whitelist files contain the expected sequences
- Consider if Levenshtein/Hamming thresholds need adjustment (hardcoded at 2)
Ensure the conda environment is activated:
conda activate bcparserThe pipeline processes reads sequentially and should handle large files efficiently. If issues persist, process files individually rather than in batch.
If you use this pipeline in your research, please cite appropriately.
[Add your license information here]
[Add your contact information here]