Skip to content

Commit 5c5c819

Browse files
finish draft of BioAlignments package explanation
1 parent 3fd96b9 commit 5c5c819

File tree

1 file changed

+61
-29
lines changed

1 file changed

+61
-29
lines changed

cookbook/02-alignments.md

Lines changed: 61 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -5,52 +5,83 @@ rss_descr = "Align a gene against a reference genome using BioAlignments.jl"
55

66
# Pairwise Alignment
77

8-
On the most basic level, aligners take two sequences and use algorithms to try and "line them up"
8+
On the most basic level, aligners take two sequences and use algorithms to try to "line them up"
99
and look for regions of similarity.
1010

11-
Pairwise alignment differs from multiple sequence alignment (MSA) because.
12-
it only aligns two sequences, while MSA's align three or more.
13-
In a pairwise alignment, there is one reference sequence, and one query sequence,
11+
Pairwise alignment differs from multiple sequence alignment (MSA) because
12+
it only aligns two sequences, while MSAs align three or more.
13+
In a pairwise alignment, there is one reference sequence and one query sequence,
1414
though this may not always be specified by the user.
1515

1616

1717
### Running the Alignment
18-
There are two main parameters for determining how we wish to perform our alignment:
18+
There are two main parameters for determining how we want to perform our alignment:
1919
the alignment type and score/cost model.
2020

21-
The alignment type specifies the alignment range (is the alignment local or global?)
21+
The alignment type specifies the alignment range (local vs global alignment)
2222
and the score/cost model explains how to score insertions and deletions.
2323

2424
#### Alignment Types
2525
Currently, four types of alignments are supported:
26-
- GlobalAlignment: global-to-global alignment
26+
- `GlobalAlignment`: global-to-global alignment
2727
- Aligns sequences end-to-end
2828
- Best for sequences that are already very similar
29-
- SemiGlobalAlignment: local-to-global alignment
30-
- a modification of global alignment that allows the user to specify that gaps will be penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)).
31-
- LocalAlignment: local-to-local alignment
29+
- All of query is aligned to all of reference
30+
- `SemiGlobalAlignment`: local-to-global alignment
31+
- A modification of global alignment that allows the user to specify that gaps are penalty-free at the beginning of one of the sequences and/or at the end of one of the sequences (more information can be found [here](https://www.cs.cmu.edu/~durand/03-711/2023/Lectures/20231001_semi-global.pdf)).
32+
- `LocalAlignment`: local-to-local alignment
3233
- Identifies high-similarity, conserved sub-regions within divergent sequences
3334
- Can occur anywhere in the alignment matrix
34-
- OverlapAlignment: end-free alignment
35-
- a modification of global alignment where gaps at the beginning or end of sequences are permitted
35+
- Maps the query sequence to the most similar region on the reference
36+
- `OverlapAlignment`: end-free alignment
37+
- A modification of global alignment where gaps at the beginning or end of sequences are permitted
3638

37-
Alignment type can also be a distance of two sequences:
38-
- EditDistance
39-
- LevenshteinDistance
40-
- HammingDistance
39+
The alignment type should be selected based on what is already known about the sequences the user is comparing:
40+
- Are the two sequences very similar and we're looking for a couple of small differences?
41+
- Is the query expected to be a nearly exact match within the reference?
42+
- Are we looking at two sequences from wildly divergent organisms?
4143

42-
The alignment type should be selected based on what is already known about the sequences the user is comparing
43-
(Are they very similar and we're looking for a couple of small differences?
44-
Are we expecting the query to be a nearly exact match within the reference?).
45-
and what you may be optimizing for
46-
(Speed for a quick and dirty analysis?
47-
Or do we want to use more resources to do a fine-grained comparison?).
4844

49-
Now that we have a good understanding of how `pairalign` works,
45+
### Cost Model
46+
47+
The cost model provides a way to calculate penalties for differences between the two sequences,
48+
and then finds the alignment that minimizes the total penalty.
49+
`AffineGapScoreModel` is the scoring model currently supported by `BioAlignments.jl`.
50+
It imposes an affine gap penalty for insertions and deletions,
51+
which means that it penalizes the opening of a gap more than a gap extending.
52+
This aligns (pun intended!!) with the biological principle that creating a gap is a rare event,
53+
while extending an already existing gap is less so.
54+
55+
A user can also define their own `CostModel` instead of using `AffineGapScoreModel`.
56+
This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions.
57+
58+
After the cost model is defined, a distance metric is used to quantify and minimize the "cost" (difference) between the two sequences.
59+
60+
These distance metrics are currently supported:
61+
- `EditDistance`
62+
- `LevenshteinDistance`
63+
- `HammingDistance`
64+
65+
This is a complicated topic, and more information can be found in the BioAlignments documentation about the cost model [here](https://biojulia.dev/BioAlignments.jl/stable/pairalign/).
66+
67+
Just like alignment type, the cost model should be selected based on what the user is optimizing for
68+
and what is known about the two sequences.
69+
70+
71+
### Calling BioAlignments to Run the Alignment
72+
73+
Now that we have a good understanding of how `pairalign` works, let's run an example!
5074

5175
```julia
76+
using BioAlignments
77+
78+
s1 =
79+
s2 =
80+
scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
81+
5282
res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment
5383

84+
res
5485
```
5586

5687

@@ -59,17 +90,18 @@ The output of an alignment is a series of `AlignmentAnchor` objects.
5990
This data structure gives information on the position of the start of the alignment,
6091
sections where nucleotides match, as well as where there may be deletions or insertions.
6192

62-
Below is an example Alignment:
93+
Below is an example alignment:
6394
```julia
6495
julia> Alignment([
6596
AlignmentAnchor(0, 4, 0, OP_START),
6697
AlignmentAnchor(4, 8, 4, OP_MATCH),
6798
AlignmentAnchor(4, 12, 8, OP_DELETE)
6899
])
69100
```
70-
In this example, the alignment starts at the 0 position for the query sequence and 4th position for the reference sequence.
71-
The next nucleotides are a match in the query and reference sequence.
72-
The last 8 nucleotides in the alignment are missing/deleted in the query sequence.
101+
In this example, the alignment starts at position 0 for the query sequence and position 4 for the reference sequence.
102+
Although the Julia programming language typically uses 1-based indexing,
103+
this package uses position 0 to refer to the first position.
104+
The next nucleotides are a match in the query and reference sequences.
105+
The last 8 nucleotides in the alignment are deleted in the query sequence.
73106

74-
To understand more about the output of the alignment created using BioAlignments.jl,
75-
more information can be found [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/).
107+
To learn more about the output of the alignment created using BioAlignments.jl, see [here](https://biojulia.dev/BioAlignments.jl/stable/alignments/).

0 commit comments

Comments
 (0)