Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

New to MATLAB?

counting occurances of a specific character in a cell array

Asked by Tony

Tony (view profile)

on 19 Jan 2013

Hi guys,

I want to count repeated occurances of characters in a cell array,

e.g.

AAA AAT AAG AAT AGC ACG

I want something to automatically identify and count the occurrences?

could anyone give me some help.

1 Comment

Walter Roberson

Walter Roberson (view profile)

on 19 Jan 2013

So 'AAA' would be 3 'A's? Or do you mean that you want to count the number of 'AAA', the number of 'AAT', and so on?

The cell arrays: is there one entry per cell, or are they blank-separated strings that need to be broken up?

Tony

Tony (view profile)

Products

No products are associated with this question.

1 Answer

Answer by Cedric Wannaz

Cedric Wannaz (view profile)

on 19 Jan 2013
Edited by Cedric Wannaz

Cedric Wannaz (view profile)

on 26 Jan 2013

Assuming that these are amino acids/codons (3 uppercase letters), here are three "not-very-orthodox" solutions, just for fun. But keep in mind that with bioinformatics being a hot topic, there are quite a few very specialized libs out there (e.g. http://www.mathworks.com/help/bioinfo/functionlist.html) that would do the job in a much better fashion. You might also get a more orthodox version from someone else once you answer Walter's comment.

Assuming, for the example (but it works for any cell array of 3 uppercase letters codes):

 C = {'AAA','AAT','AAG','AAT','AGC','ACG'} ;
 n = numel(C) ;

1. Probably the most efficient of these non-orthodox solutions (~0.58s for processing 1 million codons on my poor laptop):

 D = accumarray([[C{:}]-64; reshape([1;1;1]*(1:n), 1, [])].', 1, [26 n]) ;

2. Closely followed by a "sparse" version:

 D = sparse([C{:}]-64, reshape([1;1;1]*(1:n), 1, []), ones(1,3*n), 26, n) ;

3. And finally a much less efficient cell2mat/cellfun:

 D = cell2mat(cellfun(@(code)accumarray(code.'-64, 1, [26,1]), C, ...  
                      'UniformOutput', false)) ;

They all three produce a 26 x #codes matrix whose columns are the distributions of the 26 letters of the alphabet for each code, with row index = letter ID, A=1,..,Z=26. (the sparse version produces a sparse matrix) :

 >> D
 D =
     3     2     2     2     1     1
     0     0     0     0     0     0
     0     0     0     0     1     1
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     1     0     1     1
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     1     0     1     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0

Note that the 3rd version doesn't assume 3 letters codes and would work with arbitrary codes lengths. The first 2 versions could be adapted to have this flexibility.

Cheers,

Cedric

0 Comments

Cedric Wannaz

Cedric Wannaz (view profile)

Contact us