MATLAB Answers

0

Replacing characters with integers in a very long string

Asked by Paolo Binetti on 17 Dec 2016
Latest activity Commented on by Star Strider
on 18 Dec 2016
I have a string of a few millions characters, want to replace it with a vector of integers according to simple rules, such as 'C' = -1 and so forth. My implementation works but takes forever and uses gigabytes of memory, in particular due to the str2num function, to my understanding. Is there a way to go more efficiently?
sequence = fileread('sourcefile.txt');
sequence_num = strrep(sequence, 'A', '0 ');
sequence_num = strrep(sequence_num,'C','-1 ');
sequence_num = strrep(sequence_num,'G', '1 ');
sequence_num = strrep(sequence_num,'T', '0 ');
sequence_num = regexprep(sequence_num,'\r\n','');
sequence_num = str2num(sequence_num);
sequence_num = int32(sequence_num);

  0 Comments

Sign in to comment.

1 Answer

Answer by Star Strider
on 17 Dec 2016
 Accepted Answer

I don’t know what structure ‘sequence’ has. I created it as a cell array here:
bases = {'A','C','T','G'}; % Cell Array
sequence = bases(randi(4, 1, 20)); % Create Data
skew = zeros(1, length(sequence)+1,'int32'); % Preallocate
Cix = find(ismember(sequence, 'C')); % Logical Vector
Gix = find(ismember(sequence, 'G')); % Logical Vector
skew(Cix+1) = -1; % Replace With Integer
skew(Gix+1) = +1; % Replace With Integer

  7 Comments

@Paolo: strrep is much faster than regexprep:
sequence = strrep(sequence, sprintf('\r\n'), '');
Another simplification:
bases = ['A','C','T','G'];
sequence = bases(randi(4, 1, 5000000));
skew = zeros(1, length(sequence), 'int32');
Cix = (sequence == 'C');
Gix = (sequence == 'G');
skew(Cix) = -1;
skew(Gix) = +1;
Thank you @Star and @Jan. All in your help sped up my code 700x times, now 0.17 s for a bacterium genome. About 250 times thanks to @Star suggestions, and 3 more times thanks to @Jan final simplification.
Our pleasure!
It is always more gratifying to help with real-world research. We wish you well!

Sign in to comment.