How to correct nt2aa to skip codons with gaps?

Asked by Kendall
on 19 Jan 2013


I am using MATLAB to analyze a large number of gene sequences in a .fasta file. Part of my analysis then requires the amino acid sequences coded by the genes. I am using the nt2aa function in MATLAB. However, at least one of the sequences has a gap in at least one of its codons (A-A). As such, I am receiving the following error:

"Error using nt2aa (line 116) The sequence includes a codon A-A containing a gap. Gaps are supported only when a complete codon is made up of gaps (---)."

Any suggestions as to how I may be able to get around this? I am very hesitant to start messing with MATLAB's nt2aa function.

1 Answer

Answer by Cedric Wannaz
on 19 Jan 2013
Edited by Cedric Wannaz
on 21 Jan 2013

I don't know nt2aa, but I just had a fast look. Do you want to:

  • Modify nt2aa so it eliminates codons with gaps? Not sure what the license says about it, but I guess that it could be done.
  • Find a specialist who could tell you how to do it correctly with the bioinformatics toolbox? In that case, you might want to check what folks from the newsgroup have to say I guess. It is certainly possible, maybe even with nt2aa as its seems that it has features for managing ambiguous sequences.
  • Build some solution by yourself to pre-process or post-process your codons/AA chains?

If you are game for the latter option, we can discuss some solution a bit in the style of this post.

For example, if you have your codons in a cell array like

 NT = {'AAA','AAT','AAG','A-T','AGC','--G'} ;

you can easily find cells that contain a codon with one or more '-':

 >> hasDash = cellfun(@(x)any(x=='-'), NT)
 hasDash =  0     0     0     1     0     1

and remove these cells:

 >> NTclean = NT ;                               % In case you want to keep 
 >> NTclean(hasDash) = []                        % the original cell array.
 NTclean = 'AAA'    'AAT'    'AAG'    'AGC'

Then you can feed nt2aa with the 'cleaned' version of NT:

 >> AAclean = nt2aa(NTclean)
 AAclean = 'K'    'N'    'K'    'S'

If you wanted to insert empty cells in AAclean afterwards at locations where there were codons with gaps (to have a record), you could do as follows:

 >> buffer = 1:numel(NT) ;
 >> validId = buffer(~hasDash) ;
 >> AA = cell(1, numel(NT)) ;
 >> AA(validId) = AAclean(:)
 AA = 'K'    'N'    'K'    []    'S'    []




