Skip to content

Commit f4cb9c9

Browse files
add first pass at solution
1 parent 13a1e62 commit f4cb9c9

File tree

1 file changed

+94
-19
lines changed

1 file changed

+94
-19
lines changed

docs/src/rosalind/10-cons.md

Lines changed: 94 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22

33
🤔 [Problem link](https://rosalind.info/problems/cons/)
44

5-
!!! warning "The Problem"
5+
!!! warning "The Problem".
6+
67
A matrix is a rectangular table of values divided into rows and columns.
7-
An m×n matrix has m rows and ncolumns.
8+
An m×n matrix has m rows and n columns.
89
Given a matrix A, we write Ai,j.
910
to indicate the value found at the intersection of row i and column j.
1011

@@ -45,7 +46,8 @@
4546

4647
Return:
4748
A consensus string and profile matrix for the collection.
48-
(If several possible consensus strings exist, then you may return any one of them.)
49+
(If several possible consensus strings exist,
50+
then you may return any one of them.)
4951

5052
Sample Dataset
5153
>Rosalind_1
@@ -72,44 +74,117 @@
7274

7375

7476
The first thing we will need to do is read in the input fasta.
75-
In this case, we will not be reading in a fasta file,
76-
but a set of strings in fasta format.
77-
Once it is read in, we can iterate over the strings and store the strings in a data matrix.
77+
In this case, we will not be reading in an actual fasta file,
78+
but a set of strings in fasta format.
79+
If we were reading in an actual fasta file,
80+
we could use the [FASTX.jl](https://github.com/BioJulia/FASTX.jl) package to help us with that.
81+
82+
Since the task required here is something that was already demonstrated in the [GC-content tutorial](./05-gc.md),
83+
we can borrow the function from that tutorial.
84+
85+
```julia
86+
87+
fake_file = IOBuffer("""
88+
>Rosalind_1
89+
ATCCAGCT
90+
>Rosalind_2
91+
GGGCAACT
92+
>Rosalind_3
93+
ATGGATCT
94+
>Rosalind_4
95+
AAGCAACC
96+
>Rosalind_5
97+
TTGGAACT
98+
>Rosalind_6
99+
ATGCCATT
100+
>Rosalind_7
101+
ATGGCACT
102+
"""
103+
)
104+
105+
function parse_fasta(buffer)
106+
records = [] # this is a Vector of type `Any`
107+
record_name = ""
108+
sequence = ""
109+
for line in eachline(buffer)
110+
if startswith(line, ">")
111+
!isempty(record_name) && push!(records, (record_name, sequence))
112+
record_name = lstrip(line, '>')
113+
sequence = ""
114+
else
115+
sequence *= line
116+
end
117+
end
118+
push!(records, (record_name, sequence))
119+
return records
120+
end
121+
122+
records = parse_fasta(fake_file)
123+
```
124+
125+
Once the fasta is read in, we can iterate over each read and store its nucleotide sequence in a data matrix.
78126

79127
From there, we can generate the profile matrix.
80128
We'll need to sum the number of times each nucleotide appears at a particular row of the data matrix.
81129

82-
Then, we can identify the most common nucleotide at each column of the data matrix.
130+
Then, we can identify the most common nucleotide at each column of the data matrix,
131+
which represents each index of the consensus string.
83132
After we have done this for all columns of the data matrix,
84133
we can generate the consensus string.
85134

86-
It is possible that there can be multiple consensus strings,
87-
as some nucleotides may appear the same number of times
88-
in each column of the data matrix.
89-
If this is the case, we can return multiple consensus strings.
90-
91135

92136
```julia
137+
using DataFrames
93138

94-
function consensus(fasta)
95-
# read in strings in fasta format
139+
function consensus(fasta_string)
140+
141+
# extract strings from fasta
142+
records = parse_fasta(fasta_string)
96143

97-
data_matrix = []
98-
# iterate over strings and store in matrix
144+
# make a vector of just strings
145+
data_vector = last.(records)
99146

100-
# make consensus matrix
147+
# convert data_vector to matrix where each column is a char and each row is a string
148+
data_matrix = reduce(vcat, permutedims.(collect.(data_vector)))
101149

150+
# make profile matrix
102151

103-
# make consensus string
152+
## Is it possible to do this in a more efficient vectorized way? I wanted to see if we could do countmap() for each column in a simple way that would involve looping over each column. I think this ended up being more efficient since we are just looping over each of the nucleotides
104153

154+
consensus_matrix_list = Vector{Int64}[]
155+
for nuc in ['A', 'C', 'G', 'T']
156+
nuc_count = vec(sum(x->x==nuc, data_matrix, dims=1))
157+
push!(consensus_matrix_list, nuc_count)
158+
end
105159

160+
consensus_matrix = vcat(consensus_matrix_list)
106161

162+
# convert matrix to DF and add row names for nucleotides
163+
consensus_df = DataFrame(consensus_matrix, ["A", "C", "G", "T"])
107164

108165

166+
# make column with nucleotide with max value
167+
# argmax returns the index or key of the first one encountered
168+
nuc_max_df = transform(consensus_df, AsTable(:) => ByRow(argmax) => :MaxColName)
109169

170+
# return consensus string
171+
return join(nuc_max_df.MaxColName)
110172

173+
end
111174

175+
consensus(fake_file)
176+
```
177+
178+
As mentioned in the problem description above,
179+
it is possible that there can be multiple consensus strings,
180+
as some nucleotides may appear the same number of times
181+
in each column of the data matrix.
112182

183+
If this is the case,
184+
the function we are using (`argmax`) returns the nucleotide with the most occurences that it first encounters.
113185

186+
The way our function is written,
187+
we first scan for 'A', 'C', then 'G' and 'T',
188+
so the final consensus string will be biased towards more A's, then C's, G's and T's.
189+
This simply based on which nucleotide counts it will encounter first in the profile matrix.
114190

115-
```

0 commit comments

Comments
 (0)