counting occurances of a specific character in a cell array

Tony (view profile)

on 19 Jan 2013

Hi guys,

I want to count repeated occurances of characters in a cell array,

e.g.

AAA AAT AAG AAT AGC ACG

I want something to automatically identify and count the occurrences?

could anyone give me some help.

Walter Roberson

Walter Roberson (view profile)

on 19 Jan 2013

So 'AAA' would be 3 'A's? Or do you mean that you want to count the number of 'AAA', the number of 'AAT', and so on?

The cell arrays: is there one entry per cell, or are they blank-separated strings that need to be broken up?

Cedric Wannaz (view profile)

on 19 Jan 2013
Edited by Cedric Wannaz

Cedric Wannaz (view profile)

on 26 Jan 2013

Assuming that these are amino acids/codons (3 uppercase letters), here are three "not-very-orthodox" solutions, just for fun. But keep in mind that with bioinformatics being a hot topic, there are quite a few very specialized libs out there (e.g. http://www.mathworks.com/help/bioinfo/functionlist.html) that would do the job in a much better fashion. You might also get a more orthodox version from someone else once you answer Walter's comment.

Assuming, for the example (but it works for any cell array of 3 uppercase letters codes):

``` C = {'AAA','AAT','AAG','AAT','AGC','ACG'} ;
n = numel(C) ;```

1. Probably the most efficient of these non-orthodox solutions (~0.58s for processing 1 million codons on my poor laptop):

` D = accumarray([[C{:}]-64; reshape([1;1;1]*(1:n), 1, [])].', 1, [26 n]) ;`

2. Closely followed by a "sparse" version:

` D = sparse([C{:}]-64, reshape([1;1;1]*(1:n), 1, []), ones(1,3*n), 26, n) ;`

3. And finally a much less efficient cell2mat/cellfun:

``` D = cell2mat(cellfun(@(code)accumarray(code.'-64, 1, [26,1]), C, ...
'UniformOutput', false)) ;```

They all three produce a 26 x #codes matrix whose columns are the distributions of the 26 letters of the alphabet for each code, with row index = letter ID, A=1,..,Z=26. (the sparse version produces a sparse matrix) :

``` >> D
D =
3     2     2     2     1     1
0     0     0     0     0     0
0     0     0     0     1     1
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     1     0     1     1
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     1     0     1     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0```

Note that the 3rd version doesn't assume 3 letters codes and would work with arbitrary codes lengths. The first 2 versions could be adapted to have this flexibility.

Cheers,

Cedric