Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 227 additions & 0 deletions rosalind/10-cons.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
+++
using Dates
date = Date("2026-03-02")
title = "Problem 10: Consensus and Profile"
rss_descr = "Solving Rosalind problem CONS — finding a consensus string from a collection of DNA strings — using base Julia, DataFrames, and matrix operations"
+++

# Consensus and Profile

🤔 [Problem link](https://rosalind.info/problems/cons/)

> **The Problem**
>
> A matrix is a rectangular table of values divided into rows and columns.
> An m×n matrix has m rows and n columns.
> Given a matrix A, we write Ai,j.
> to indicate the value found at the intersection of row i and column j.

> Say that we have a collection of DNA strings,
> all having the same length n.
> Their profile matrix is a 4×n matrix P in which P1,
> j represents the number of times that 'A' occurs in the jth position of one of the strings,
> P2,j represents the number of times that C occurs in the jth position,
> and so on (see below).

> A consensus string c is a string of length n
> formed from our collection by taking the most common symbol at each position;
> the jth symbol of c therefore corresponds to the symbol having the maximum value
> in the j-th column of the profile matrix.
> Of course, there may be more than one most common symbol,
> leading to multiple possible consensus strings.
>
> ### DNA Strings
> ```
> A T C C A G C T
> G G G C A A C T
> A T G G A T C T
> A A G C A A C C
> T T G G A A C T
> A T G C C A T T
> A T G G C A C T
> ```
>
> ### Profile
> ```
> A 5 1 0 0 5 5 0 0
> C 0 0 1 4 2 0 6 1
> G 1 1 6 3 0 1 0 0
> T 1 5 0 0 0 1 1 6
> ```
>
> ### Consensus
> ```A T G C A A C T```
>
> **Given:**
> A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
>
> **Return:**
> A consensus string and profile matrix for the collection.
> (If several possible consensus strings exist,
> then you may return any one of them.)
>
> **Sample Dataset***
>
> ```
> >Rosalind_1
> ATCCAGCT
> >Rosalind_2
> GGGCAACT
> >Rosalind_3
> ATGGATCT
> >Rosalind_4
> AAGCAACC
> >Rosalind_5
> TTGGAACT
> >Rosalind_6
> ATGCCATT
> >Rosalind_7
> ATGGCACT
> ```
>
> **Sample Output**
> ```
> ATGCAACT
> A: 5 1 0 0 5 5 0 0
> C: 0 0 1 4 2 0 6 1
> G: 1 1 6 3 0 1 0 0
> T: 1 5 0 0 0 1 1 6
> ```


The first thing we will need to do is read in the input fasta.
In this case, we will not be reading in an actual fasta file,
but a set of strings in fasta format.
If we were reading in an actual fasta file,
we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that.

Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md),
we can borrow the function from that tutorial.

```julia

fake_file = IOBuffer("""
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
"""
)

function parse_fasta(buffer)
records = [] # this is a Vector of type `Any`
record_name = ""
sequence = ""
for line in eachline(buffer)
if startswith(line, ">")
!isempty(record_name) && push!(records, (record_name, sequence))
record_name = lstrip(line, '>')
sequence = ""
else
sequence *= line
end
end
push!(records, (record_name, sequence))
return records
end

records = parse_fasta(fake_file)
```

Once the fasta is read in, we can iterate over each sequence/record and store its nucleotide sequence in a data matrix.

From there, we can generate the profile matrix.
We'll need to sum the number of times each nucleotide appears at a particular column of the data matrix.

Then, we can identify the most common nucleotide at each column of the data matrix,
which represent each index of the consensus string.
After doing this for all columns of the data matrix,
we can generate the consensus string.


```julia
using DataFrames

function consensus(fasta_string)

# extract strings from fasta
records = parse_fasta(fasta_string)

# make a vector of sequence strings
data_vector = last.(records)

# convert data_vector to matrix where each column is a character position and each row is a string
data_matrix = reduce(vcat, permutedims.(collect.(data_vector)))

# make profile matrix
consensus_matrix_list = Vector{Int64}[]
for nuc in ['A', 'C', 'G', 'T']
nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1))
nuc_count = vec(sum(==(nuc), data_matrix, dims=1))

Lots of functions are defined so as not to need an anonymous function like that. It works because ==(thing) returns a function that's essentially x-> x == thing

julia> f = ==(1)
(::Base.Fix2{typeof(==), Int64}) (generic function with 2 methods)

julia> f(1)
true

julia> f(2)
false

push!(consensus_matrix_list, nuc_count)
end

consensus_matrix = vcat(consensus_matrix_list)

# convert matrix to DF and add row names for nucleotides
consensus_df = DataFrame(consensus_matrix, ["A", "C", "G", "T"])


# make column with nucleotide with the max value
# argmax returns the index or key of the first one encountered
nuc_max_df = transform(consensus_df, AsTable(:) => ByRow(argmax) => :MaxColName)

# return consensus string
return join(nuc_max_df.MaxColName)

end

consensus(fake_file)
```

As mentioned in the problem description above,
it is possible that there can be multiple consensus strings,
as some nucleotides may appear the same number of times
in each column of the data matrix.

If this is the case,
the function we are using (`argmax`) returns the nucleotide with the most occurrences that it first encounters.

The way our function is written,
we first scan for 'A', 'C', then 'G' and 'T',
so the final consensus string will be biased towards more A's, then C's, G's and T's.
This is simply based on which nucleotide counts it will encounter first in the profile matrix.

In the example below, there are equal number of sequences that are all `A`'s and `G`'s,
so the consensus string could be either `AAAAAAAA` or `GGGGGGGG`.
However, because our solution scans for `A` first,
the consensus string returned will be `AAAAAAAA`.

```julia
fake_file2 = IOBuffer("""
>Rosalind_1
AAAAAAAA
>Rosalind_2
AAAAAAAA
>Rosalind_3
AAAAAAAA
>Rosalind_4
GGGGGGGG
>Rosalind_5
GGGGGGGG
>Rosalind_6
GGGGGGGG
"""
)

consensus(fake_file2)
```