Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

counting occurances of a specific character in a cell array

Asked by Tony on 19 Jan 2013

Hi guys,

I want to count repeated occurances of characters in a cell array,

e.g.

AAA AAT AAG AAT AGC ACG

I want something to automatically identify and count the occurrences?

could anyone give me some help.

1 Comment

Walter Roberson on 19 Jan 2013

So 'AAA' would be 3 'A's? Or do you mean that you want to count the number of 'AAA', the number of 'AAT', and so on?

The cell arrays: is there one entry per cell, or are they blank-separated strings that need to be broken up?

Tony

Products

No products are associated with this question.

1 Answer

Answer by Cedric Wannaz on 19 Jan 2013
Edited by Cedric Wannaz on 26 Jan 2013

Assuming that these are amino acids/codons (3 uppercase letters), here are three "not-very-orthodox" solutions, just for fun. But keep in mind that with bioinformatics being a hot topic, there are quite a few very specialized libs out there (e.g. http://www.mathworks.com/help/bioinfo/functionlist.html) that would do the job in a much better fashion. You might also get a more orthodox version from someone else once you answer Walter's comment.

Assuming, for the example (but it works for any cell array of 3 uppercase letters codes):

 C = {'AAA','AAT','AAG','AAT','AGC','ACG'} ;
 n = numel(C) ;

1. Probably the most efficient of these non-orthodox solutions (~0.58s for processing 1 million codons on my poor laptop):

 D = accumarray([[C{:}]-64; reshape([1;1;1]*(1:n), 1, [])].', 1, [26 n]) ;

2. Closely followed by a "sparse" version:

 D = sparse([C{:}]-64, reshape([1;1;1]*(1:n), 1, []), ones(1,3*n), 26, n) ;

3. And finally a much less efficient cell2mat/cellfun:

 D = cell2mat(cellfun(@(code)accumarray(code.'-64, 1, [26,1]), C, ...  
                      'UniformOutput', false)) ;

They all three produce a 26 x #codes matrix whose columns are the distributions of the 26 letters of the alphabet for each code, with row index = letter ID, A=1,..,Z=26. (the sparse version produces a sparse matrix) :

 >> D
 D =
     3     2     2     2     1     1
     0     0     0     0     0     0
     0     0     0     0     1     1
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     1     0     1     1
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     1     0     1     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0

Note that the 3rd version doesn't assume 3 letters codes and would work with arbitrary codes lengths. The first 2 versions could be adapted to have this flexibility.

Cheers,

Cedric

0 Comments

Cedric Wannaz

Contact us