Decomposes satellite DNA arrays into monomers within telomere-to-telomere (T2T) assemblies. Ideal for analyzing centromeric and pericentromeric regions on monomeric level.
Status: Production ready. Successfully handles arrays from kilobase to megabase scale.
Key Features:
- De novo monomer identification without prior knowledge
- Autocorrelation-based period detection for robust periodicity analysis
- Automatic orientation to canonical form (A>T, C>G)
- Deterministic output sorted by genomic coordinates
- Multi-threaded processing
Performance: CHM13v2.0 assembly (~1300 alpha satellite arrays) processes in ~3.5 minutes (16 threads)
pip install arraysplitterOr build from source:
cd src/rust/arraysplitter
cargo build --release# Basic decomposition
arraysplitter -i arrays.fa -o output_prefix -t 16
# With predefined cut sequences
arraysplitter -i arrays.fa -o output_prefix -c ATG,CGCG -t 16
# Show version
arraysplitter --versionAll output is deterministically sorted by chromosome and genomic position (chr1 → chr22 → chrX → chrY → chrM).
| File | Description |
|---|---|
.decomposed.fasta |
Monomers with orientation info in headers |
.hors.tsv |
HOR-level decomposition (16 columns) |
.monomers.tsv |
Base-level monomers from recursive decomposition (17 columns) |
.summary.tsv |
One-row-per-array summary with HOR and monomer statistics (23 columns) |
.lengths |
Fragment lengths for each array |
One row per array combining HOR-level and monomer-level statistics. Useful for overview analysis.
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
array_length |
Total array length in bp |
orientation |
fwd or rev (reverse complemented to canonical) |
method |
Detection method used (autocorr, classic) |
| HOR-level stats | |
hor_period |
Detected HOR period in bp |
hor_autocorr |
Autocorrelation at HOR period |
hor_n_monomers |
Number of HOR-level monomers |
hor_mean_ed_tmpl |
Mean edit distance to HOR consensus |
hor_mean_ed_prev |
Mean edit distance between adjacent HORs |
hor_cv |
Coefficient of variation for HOR lengths |
hor_consensus |
Consensus sequence at HOR level |
hor_iupac |
IUPAC ambiguity codes (bases ≥20% frequency) |
hor_quality |
Per-position support (digit 0-9, 9=90-100%) |
| Monomer-level stats | |
mono_period |
Median base monomer period |
mono_autocorr |
Mean autocorrelation at monomer level |
mono_n_monomers |
Total number of base monomers |
mono_mean_ed_tmpl |
Mean edit distance to monomer consensus |
mono_mean_ed_prev |
Mean edit distance between adjacent monomers |
mono_cv |
Mean coefficient of variation |
mono_consensus |
Consensus sequence at monomer level |
mono_iupac |
IUPAC ambiguity codes |
mono_quality |
Per-position support |
cut_sequence |
Anchor k-mer used for splitting |
Contains the primary decomposition into HOR (Higher Order Repeat) monomers. Multiple rows per array.
Row types (in order):
pred_array- Array-level prediction/header rowflank- Terminal fragments <70% of periodmonomer- Full HOR monomers (sorted by idx)array- Summary statistics rowconsensus- Consensus sequence row
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
type |
pred_array, monomer, flank, array, consensus |
idx |
Monomer index within array (0-based) |
length |
Sequence length in bp |
source |
Detection method: anchor, split_2x, split_3x, left_flank, right_flank |
ed_tmpl |
Edit distance to consensus template |
ed_prev |
Edit distance to previous monomer |
ed_next |
Edit distance to next monomer |
period |
Detected repeat period in bp |
autocorr |
Autocorrelation value at detected period |
n_expected |
Expected count of monomers (array_len / period) |
ed_per_bp |
Normalized edit distance (ed / length) |
cv |
Coefficient of variation for lengths |
cut_sequence |
Anchor sequence used for splitting |
orientation |
fwd or rev (reverse complemented) |
sequence |
Actual DNA sequence (or - for pred_array/array rows) |
Contains base-level monomers after recursive HOR decomposition. Unified format matching .hors.tsv plus parent_idx.
Each HOR is recursively decomposed until:
- No further periodicity detected (autocorrelation ≤ 0.5)
- Minimum length (5bp) reached
Row types (in order):
pred_array- Array-level summary rowbase_monomer- Base-level monomers from recursive decompositionmonomer- Non-decomposable monomers (e.g., telomeres)
| Column | Description |
|---|---|
array_id |
Array identifier |
type |
pred_array, base_monomer, monomer |
idx |
Global index within array (0-based) |
length |
Sequence length in bp |
source |
recursive_anchor, recursive_split, base, recursive_flank |
ed_tmpl |
Edit distance to submonomer consensus |
ed_prev |
Edit distance to previous base monomer |
ed_next |
Edit distance to next base monomer |
period |
Detected period at this level (0 if base) |
autocorr |
Autocorrelation value |
n_expected |
Always 1 for individual monomers |
ed_per_bp |
Normalized edit distance |
cv |
Coefficient of variation within parent group |
cut_sequence |
Inherited anchor sequence |
orientation |
Inherited from array (fwd/rev) |
parent_idx |
Index of parent HOR from .hors.tsv |
sequence |
Actual DNA sequence |
For a typical α-satellite HOR (512bp → 3×171bp monomers):
.hors.tsv - 10 HOR monomers (~512bp each):
array_id type idx length period ...
chr1_centromere pred_array 10 5120 512 ...
chr1_centromere monomer 0 512 512 ...
chr1_centromere monomer 1 512 512 ...
...
chr1_centromere array 10 5120 512 ...
chr1_centromere consensus 10 512 512 ... [consensus seq]
.monomers.tsv - 30 base monomers (~171bp each):
array_id type idx length parent_idx ...
chr1_centromere pred_array 30 5120 - ...
chr1_centromere base_monomer 0 171 0 ...
chr1_centromere base_monomer 1 171 0 ...
chr1_centromere base_monomer 2 170 0 ...
chr1_centromere base_monomer 3 171 1 ...
...
.summary.tsv - Single row with both levels:
array_id length hor_period hor_n_monomers mono_period mono_n_monomers ...
chr1_centromere 5120 512 10 171 30 ...
ArraySplitter employs an autocorrelation-based algorithm for detecting repeat periods and decomposing satellite DNA arrays.
Arrays are oriented to canonical form:
- Primary rule: A > T (more A's than T's)
- Secondary rule: C > G (if A=T)
- Non-canonical arrays are reverse complemented
The algorithm computes sequence autocorrelation to detect periodicity:
autocorr(offset) = matches / comparisons
Where matches counts identical nucleotides at positions i and i + offset.
Key innovations:
- Random expectation correction: Subtracts expected random match rate based on nucleotide composition
- Refined period search: Uses FFT-like peak detection to find true period vs harmonics
- Confidence scoring: Autocorrelation excess over random indicates detection confidence
For the detected period, finds optimal anchor (cut sequence) using:
- K-mer enumeration: Extract all k-mers (k=10 by default) from the sequence
- Position analysis: For each k-mer, record all occurrence positions
- Scoring metrics:
- Uniqueness: Fraction of occurrences exactly
periodapart - Regularity: How evenly spaced the occurrences are
- Uniqueness: Fraction of occurrences exactly
- Combined score:
uniqueness × regularity - Deterministic selection: K-mers sorted lexicographically for reproducible tie-breaking
Using the selected anchor:
- Split array at all anchor occurrences
- First fragment → left flank (if < 70% of period)
- Middle fragments → monomers
- Last fragment → right flank (if < 70% of period)
- Apply heuristics for multiplet splitting (doublets, triplets, etc.)
Results are:
- Sorted by chromosome (natural order: 1, 2, ..., 22, X, Y, M)
- Within chromosome, sorted by start position
- Fully deterministic across runs
Uses autocorrelation for period detection. Best for:
- Regular tandem repeats
- Alpha satellite arrays
- HOR (Higher Order Repeat) structures
Uses frequency suffix tree approach. Better for:
- Irregular or degenerate repeats
- Very short arrays
- Arrays with high mutation rates
Tries autocorrelation first, falls back to classic if autocorr fails.
arraysplitter --help
Options:
-i, --input <FILE> Input FASTA file
-o, --output <PREFIX> Output prefix
-t, --threads <N> Number of threads [default: all cores]
-c, --cuts <SEQ,SEQ> Predefined cut sequences (comma-separated)
-d, --depth <N> Max depth for cut search [default: 100]
--method <METHOD> Detection method: autocorr, classic, both [default: autocorr]
--max-ed-len <N> Max monomer length for edit distance [default: 10000]
--stats Print detailed statistics
--top-outliers <N> Number of outliers to show [default: 10]
-V, --version Print version
If you use ArraySplitter in your research, please cite: [Publication pending]
For questions or support: ad3002@gmail.com