## counting occurances of a specific character in a cell array

Tony

on 19 Jan 2013

Hi guys,

I want to count repeated occurances of characters in a cell array,

e.g.

AAA AAT AAG AAT AGC ACG

I want something to automatically identify and count the occurrences?

could anyone give me some help.

Walter Roberson

Walter Roberson

on 19 Jan 2013

So 'AAA' would be 3 'A's? Or do you mean that you want to count the number of 'AAA', the number of 'AAT', and so on?

The cell arrays: is there one entry per cell, or are they blank-separated strings that need to be broken up?

Cedric Wannaz

on 19 Jan 2013
Edited by Cedric Wannaz

Cedric Wannaz

on 26 Jan 2013

Assuming that these are amino acids/codons (3 uppercase letters), here are three "not-very-orthodox" solutions, just for fun. But keep in mind that with bioinformatics being a hot topic, there are quite a few very specialized libs out there (e.g. http://www.mathworks.com/help/bioinfo/functionlist.html) that would do the job in a much better fashion. You might also get a more orthodox version from someone else once you answer Walter's comment.

Assuming, for the example (but it works for any cell array of 3 uppercase letters codes):

``` C = {'AAA','AAT','AAG','AAT','AGC','ACG'} ;
n = numel(C) ;```

1. Probably the most efficient of these non-orthodox solutions (~0.58s for processing 1 million codons on my poor laptop):

` D = accumarray([[C{:}]-64; reshape([1;1;1]*(1:n), 1, [])].', 1, [26 n]) ;`

2. Closely followed by a "sparse" version:

` D = sparse([C{:}]-64, reshape([1;1;1]*(1:n), 1, []), ones(1,3*n), 26, n) ;`

3. And finally a much less efficient cell2mat/cellfun:

``` D = cell2mat(cellfun(@(code)accumarray(code.'-64, 1, [26,1]), C, ...
'UniformOutput', false)) ;```

They all three produce a 26 x #codes matrix whose columns are the distributions of the 26 letters of the alphabet for each code, with row index = letter ID, A=1,..,Z=26. (the sparse version produces a sparse matrix) :

``` >> D
D =
3     2     2     2     1     1
0     0     0     0     0     0
0     0     0     0     1     1
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     1     0     1     1
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     1     0     1     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0
0     0     0     0     0     0```

Note that the 3rd version doesn't assume 3 letters codes and would work with arbitrary codes lengths. The first 2 versions could be adapted to have this flexibility.

Cheers,

Cedric