-
Notifications
You must be signed in to change notification settings - Fork 8
Consensus string tutorial #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,227 @@ | ||||||
| +++ | ||||||
| using Dates | ||||||
| date = Date("2026-03-02") | ||||||
| title = "Problem 10: Consensus and Profile" | ||||||
| rss_descr = "Solving Rosalind problem CONS — finding a consensus string from a collection of DNA strings — using base Julia, DataFrames, and matrix operations" | ||||||
| +++ | ||||||
|
|
||||||
| # Consensus and Profile | ||||||
|
|
||||||
| 🤔 [Problem link](https://rosalind.info/problems/cons/) | ||||||
|
|
||||||
| > **The Problem** | ||||||
| > | ||||||
| > A matrix is a rectangular table of values divided into rows and columns. | ||||||
| > An m×n matrix has m rows and n columns. | ||||||
| > Given a matrix A, we write Ai,j. | ||||||
| > to indicate the value found at the intersection of row i and column j. | ||||||
|
|
||||||
| > Say that we have a collection of DNA strings, | ||||||
| > all having the same length n. | ||||||
| > Their profile matrix is a 4×n matrix P in which P1, | ||||||
| > j represents the number of times that 'A' occurs in the jth position of one of the strings, | ||||||
| > P2,j represents the number of times that C occurs in the jth position, | ||||||
| > and so on (see below). | ||||||
|
|
||||||
| > A consensus string c is a string of length n | ||||||
| > formed from our collection by taking the most common symbol at each position; | ||||||
| > the jth symbol of c therefore corresponds to the symbol having the maximum value | ||||||
| > in the j-th column of the profile matrix. | ||||||
| > Of course, there may be more than one most common symbol, | ||||||
| > leading to multiple possible consensus strings. | ||||||
| > | ||||||
| > ### DNA Strings | ||||||
| > ``` | ||||||
| > A T C C A G C T | ||||||
| > G G G C A A C T | ||||||
| > A T G G A T C T | ||||||
| > A A G C A A C C | ||||||
| > T T G G A A C T | ||||||
| > A T G C C A T T | ||||||
| > A T G G C A C T | ||||||
| > ``` | ||||||
| > | ||||||
| > ### Profile | ||||||
| > ``` | ||||||
| > A 5 1 0 0 5 5 0 0 | ||||||
| > C 0 0 1 4 2 0 6 1 | ||||||
| > G 1 1 6 3 0 1 0 0 | ||||||
| > T 1 5 0 0 0 1 1 6 | ||||||
| > ``` | ||||||
| > | ||||||
| > ### Consensus | ||||||
| > ```A T G C A A C T``` | ||||||
| > | ||||||
| > **Given:** | ||||||
| > A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format. | ||||||
| > | ||||||
| > **Return:** | ||||||
| > A consensus string and profile matrix for the collection. | ||||||
| > (If several possible consensus strings exist, | ||||||
| > then you may return any one of them.) | ||||||
| > | ||||||
| > **Sample Dataset*** | ||||||
| > | ||||||
| > ``` | ||||||
| > >Rosalind_1 | ||||||
| > ATCCAGCT | ||||||
| > >Rosalind_2 | ||||||
| > GGGCAACT | ||||||
| > >Rosalind_3 | ||||||
| > ATGGATCT | ||||||
| > >Rosalind_4 | ||||||
| > AAGCAACC | ||||||
| > >Rosalind_5 | ||||||
| > TTGGAACT | ||||||
| > >Rosalind_6 | ||||||
| > ATGCCATT | ||||||
| > >Rosalind_7 | ||||||
| > ATGGCACT | ||||||
| > ``` | ||||||
| > | ||||||
| > **Sample Output** | ||||||
| > ``` | ||||||
| > ATGCAACT | ||||||
| > A: 5 1 0 0 5 5 0 0 | ||||||
| > C: 0 0 1 4 2 0 6 1 | ||||||
| > G: 1 1 6 3 0 1 0 0 | ||||||
| > T: 1 5 0 0 0 1 1 6 | ||||||
| > ``` | ||||||
|
|
||||||
|
|
||||||
| The first thing we will need to do is read in the input fasta. | ||||||
| In this case, we will not be reading in an actual fasta file, | ||||||
| but a set of strings in fasta format. | ||||||
| If we were reading in an actual fasta file, | ||||||
| we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that. | ||||||
|
|
||||||
| Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md), | ||||||
| we can borrow the function from that tutorial. | ||||||
|
|
||||||
| ```julia | ||||||
|
|
||||||
| fake_file = IOBuffer(""" | ||||||
| >Rosalind_1 | ||||||
| ATCCAGCT | ||||||
| >Rosalind_2 | ||||||
| GGGCAACT | ||||||
| >Rosalind_3 | ||||||
| ATGGATCT | ||||||
| >Rosalind_4 | ||||||
| AAGCAACC | ||||||
| >Rosalind_5 | ||||||
| TTGGAACT | ||||||
| >Rosalind_6 | ||||||
| ATGCCATT | ||||||
| >Rosalind_7 | ||||||
| ATGGCACT | ||||||
| """ | ||||||
| ) | ||||||
|
|
||||||
| function parse_fasta(buffer) | ||||||
| records = [] # this is a Vector of type `Any` | ||||||
| record_name = "" | ||||||
| sequence = "" | ||||||
| for line in eachline(buffer) | ||||||
| if startswith(line, ">") | ||||||
| !isempty(record_name) && push!(records, (record_name, sequence)) | ||||||
| record_name = lstrip(line, '>') | ||||||
| sequence = "" | ||||||
| else | ||||||
| sequence *= line | ||||||
| end | ||||||
| end | ||||||
| push!(records, (record_name, sequence)) | ||||||
| return records | ||||||
| end | ||||||
|
|
||||||
| records = parse_fasta(fake_file) | ||||||
| ``` | ||||||
|
|
||||||
| Once the fasta is read in, we can iterate over each sequence/record and store its nucleotide sequence in a data matrix. | ||||||
|
|
||||||
| From there, we can generate the profile matrix. | ||||||
| We'll need to sum the number of times each nucleotide appears at a particular column of the data matrix. | ||||||
|
|
||||||
| Then, we can identify the most common nucleotide at each column of the data matrix, | ||||||
| which represent each index of the consensus string. | ||||||
| After doing this for all columns of the data matrix, | ||||||
| we can generate the consensus string. | ||||||
|
|
||||||
|
|
||||||
| ```julia | ||||||
| using DataFrames | ||||||
|
|
||||||
| function consensus(fasta_string) | ||||||
|
|
||||||
| # extract strings from fasta | ||||||
| records = parse_fasta(fasta_string) | ||||||
|
|
||||||
| # make a vector of sequence strings | ||||||
| data_vector = last.(records) | ||||||
|
|
||||||
| # convert data_vector to matrix where each column is a character position and each row is a string | ||||||
| data_matrix = reduce(vcat, permutedims.(collect.(data_vector))) | ||||||
|
|
||||||
| # make profile matrix | ||||||
| consensus_matrix_list = Vector{Int64}[] | ||||||
| for nuc in ['A', 'C', 'G', 'T'] | ||||||
| nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1)) | ||||||
|
||||||
| nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1)) | |
| nuc_count = vec(sum(==(nuc), data_matrix, dims=1)) |
Lots of functions are defined so as not to need an anonymous function like that. It works because ==(thing) returns a function that's essentially x-> x == thing
julia> f = ==(1)
(::Base.Fix2{typeof(==), Int64}) (generic function with 2 methods)
julia> f(1)
true
julia> f(2)
false
danielle-pinto marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
danielle-pinto marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.