MATLAB Answers

0

searching through RNA sequence for a string of characters

Asked by John Jamison on 24 Apr 2017
Latest activity Edited by dpb
on 25 Apr 2017
Hi all,
I am creating a program that gets a user input (txt file) of DNA nucleotides (random lists of A, G, C, T), and gives an output file of the RNA complement (got this done), the percentages of RNA nucleotides (got this done), and asks the user if they would like to search through the RNA sequence for a particular string of nucleotides—if the string exists in the RNA file, it will tell the location in the sequence and repeat that given string.
example: the user inputs "UACG," and my program says that string of basepairs occurred at location 59.
How can I do this? My current program operates by searching through the file on a per character basis, and converts each DNA basepair to an RNA basepair.
Thank you

  0 Comments

Sign in to comment.

3 Answers

Answer by Joseph Cheng
on 24 Apr 2017

so, do you store the full compliment data in an array such that you can
%gen dummy data
comp = 'AGCT';
list = comp(randi(4,1,100))
position = strfind(list,'CAT')
where position will be the index in the string that has CAT in it?

  2 Comments

I dont believe so. Below is some of what I do:
[base, num] = fread(dnaFile, 1, 'char');
fwrite(rnaFile, out);
well then i would suggest that you do, unless the array is too large. if you read in the whole text file into an array you should be able to then use strfind to find the position, otherwise its a tricky swapping in and out of a buffer as you search for consecutive matches by looking one by one

Sign in to comment.


Answer by dpb
on 24 Apr 2017
Edited by dpb
on 24 Apr 2017

Again, all the reason in the world to read the whole string at once as recommended earlier.
But, the answer to the question posed, given what you did previously is simply
idx=strfind(out,ASKEDFORSTRING);
AFTER the loop is complete. Fuggettabout trying to do this on a character-at-a-time basis.
idx will be empty if the sequence doesn't exist, or a vector of 1 to N position(s) where the sequence is found.
Done.
I can't emphasize enough how you should convert this from the looping you're doing to processing the input string as a vector...you'll find things will go much faster after you do, the code will become almost trivial by comparison and much simpler to develop/debug/maintain.

  6 Comments

AGAIN!! THAT DOES NOT HELP!
Post the code and attach the data file--I posted results that DID work with the input string so the problem is either you have a coding error in implementation or the data file isn't just an 8-bit character stream.
But, I can't operate your console to see, you've got to supply the requisite information if you want help.
Again, ATTACH! the m-file and the input file that created the above and I'll look at it. Otherwise, I'm continuing to just waste time/effort trying to help. :(
My code is the same as from the previous post. Here is my dna.txt file that I use for my file
"My code is the same as from the previous post"
Can't be--you had to have processed the input string instead of continuing to try to read a file character-by-character or you would have had an FEOF error on the first next read.
We need to see the changes you did make...for the NaN result, it looks like you never collected a single count so the result was 0/0 for all.
ADDENDUM
OK, the input file is OK; one hurdle cleared.
What you need to do is after the above read statement, if you're still hung up on looping solution, replace the while loop with a counted for loop
for i=1:length(base)
and remove the fread statement at the end of the loop entirely.
Oh, just dawned on me--if you left the code alone, since base now refers to the entire string not just a single character, the various if clauses will never be satisfied. You must address each character in the string in turn:
for i=1:length(base)
if base(i)=='C'
...
etc., etc., etc., ...
But, of course, the thing to do when you read the full string is to use the vectorized solution that I showed you...there's where the real simplification comes from.

Sign in to comment.


Answer by dpb
on 25 Apr 2017
Edited by dpb
on 25 Apr 2017

OK, while this isn't really answering the question of the thread (I did that up above), I fixed your script to operate on the full string. This isn't fully vectorized yet but goes a fair step along the way eliminating the unnecessary big loop and switch block for the lookup table approach.
% set up lookup table inputs
DNA=['C','G','T','A']; % base characters in DNA sequence
RNA=['G','C','A','U']; % corresponding characters in RNA
% get input file, do bookkeeping for output file, etc., ..
filename=uigetfile('*.txt','Select DNA sequence file');
if filename==0, error('User Canceled'),end
[dnaFile, message] = fopen(filename, 'r');
% %if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
outputFile = [filename(1:end-4) '.rna'];
else
outputFile = [filename '.rna'];
end
[rnaFile, message] = fopen(outputFile, 'w');
% now read full string into char array
dna=fread(dnaFile,'*char').'; % orient as row vector
fclose(dnaFile); clear dnaFile % we're done with input file
% and translate DNA --> RNA
rna=RNA(arrayfun(@(c) find(c==DNA),dna));
% compute counts, frequencies
counts=zeros(size(RNA)).';
for i=1:length(RNA)
counts(i)=sum(rna==RNA(i));
end
freqs=100*counts/sum(counts);
% output results
fprintf('Base pair frequencies: \n')
for i=1:length(RNA)
fprintf('%c: %6.2f\n',RNA(i),freqs(i));
end
fwrite(rnaFile, rna);
rnaFile=fclose(rnaFile); clear rnaFile
Sample run from your dna.txt--
> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
>>
The searching would be on the variable rna as outlined above after getting the desired sequence for which to search from the user.
I attached the m-file as well...HTH.
ADDENDUM
OK, I added the lookup; using the search string you typed in the original question the nul result is the expected answer...there is no sequence like that in the output string. I ran one and picked one I knew would work and results were:
>> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
Enter RNA sequence of interest: aauu
Sequence AAUU locations (empty report-->not found)
Occurrence Location
1 3
2 16
3 28
>>
So, the lookup does work if there is anything to be found.

  2 Comments

Thank you. How can I get it to search for a user input of RNA basepairs?
I have the following, and thought it would work, but it always returns a " [] ", which means it is finding nothing...
inputBase = input('Enter RNA sequence of interest: ', 's');
location = strfind(out, inputBase);
Well, syntax looks ok but again we can't debug what we can't see, sorry.
Same old song, umpteenth verse. SHOW us enough we can see what you did and what you used to do it with (both code and data)...we can't see your terminal from here.
This can be done simply from command line with the output string you operated on and the search string as a minimum...
ADDENDUM
You did notice if you used my updated code that the output variable is now a more expressive name rna, not out, I hope?
If you're still stuck on your old code, need what said above...show us the string you're searching and for what...remember strfind is case-sensitive.

Sign in to comment.