searching through RNA sequence for a string of characters

Question

0 votes

Hi all,

I am creating a program that gets a user input (txt file) of DNA nucleotides (random lists of A, G, C, T), and gives an output file of the RNA complement (got this done), the percentages of RNA nucleotides (got this done), and asks the user if they would like to search through the RNA sequence for a particular string of nucleotides—if the string exists in the RNA file, it will tell the location in the sequence and repeat that given string.

example: the user inputs "UACG," and my program says that string of basepairs occurred at location 59.

How can I do this? My current program operates by searching through the file on a per character basis, and converts each DNA basepair to an RNA basepair.

Thank you

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Joseph Cheng on 24 Apr 2017

Open in MATLAB Online

0 votes

so, do you store the full compliment data in an array such that you can

 %gen dummy data
    comp = 'AGCT';
list = comp(randi(4,1,100))
position = strfind(list,'CAT')

where position will be the index in the string that has CAT in it?

2 Comments
Show None Hide None

John Jamison on 24 Apr 2017

Open in MATLAB Online

I dont believe so. Below is some of what I do:

[base, num] = fread(dnaFile, 1, 'char'); 
fwrite(rnaFile, out);

Joseph Cheng on 24 Apr 2017

well then i would suggest that you do, unless the array is too large. if you read in the whole text file into an array you should be able to then use strfind to find the position, otherwise its a tricky swapping in and out of a buffer as you search for consecutive matches by looking one by one

Sign in to comment.

Answer 2

dpb on 24 Apr 2017

Edited: dpb on 24 Apr 2017

Open in MATLAB Online

0 votes

Again, all the reason in the world to read the whole string at once as recommended earlier.

But, the answer to the question posed, given what you did previously is simply

idx=strfind(out,ASKEDFORSTRING);

AFTER the loop is complete. Fuggettabout trying to do this on a character-at-a-time basis.

idx will be empty if the sequence doesn't exist, or a vector of 1 to N position(s) where the sequence is found.

Done.

I can't emphasize enough how you should convert this from the looping you're doing to processing the input string as a vector...you'll find things will go much faster after you do, the code will become almost trivial by comparison and much simpler to develop/debug/maintain.

6 Comments
Show 4 older comments Hide 4 older comments

dpb on 24 Apr 2017

Edited: dpb on 24 Apr 2017

Open in MATLAB Online

I beg your pardon???

It certainly worked here on a test case; you never provided sufficient data to test with one of your files and never posted any of your attempts with the actual code and whatever errors you experienced in context to be able to debug your attempts, either.

dna=fread(dnaFile,'*char');

as shown at <Calculating-nucleotide-percentages> will give you the full DNA character string in the file opened with file handle dnaFile as a character array dna if the data are series of characters in the file as indicated, coded as 8-bit characters. If the input file encoding is different, then will need to know that, but again, you never provided the requisite information to know for certain.

It may be that the reason you needed the else clause in the switch construct was that the input was actually from a double-byte unicode string so every other character you're reading is null as just one possibility; just don't know and can't tell without an actual file (one of which I've requested you post a number of times with no response to date).

John Jamison on 24 Apr 2017

dna.txt

My code is the same as from the previous post. Here is my dna.txt file that I use for my file

dpb on 24 Apr 2017

Edited: dpb on 25 Apr 2017

Open in MATLAB Online

"My code is the same as from the previous post"

Can't be--you had to have processed the input string instead of continuing to try to read a file character-by-character or you would have had an FEOF error on the first next read.

We need to see the changes you did make...for the NaN result, it looks like you never collected a single count so the result was 0/0 for all.

ADDENDUM

OK, the input file is OK; one hurdle cleared.

What you need to do is after the above read statement, if you're still hung up on looping solution, replace the while loop with a counted for loop

for i=1:length(base)

and remove the fread statement at the end of the loop entirely.

Oh, just dawned on me--if you left the code alone, since base now refers to the entire string not just a single character, the various if clauses will never be satisfied. You must address each character in the string in turn:

for i=1:length(base)
  if base(i)=='C'
   ...

etc., etc., etc., ...

But, of course, the thing to do when you read the full string is to use the vectorized solution that I showed you...there's where the real simplification comes from.

Sign in to comment.

Answer 3

dpb on 25 Apr 2017

Edited: dpb on 25 Apr 2017

Open in MATLAB Online

0 votes

dna2rna.m

OK, while this isn't really answering the question of the thread (I did that up above), I fixed your script to operate on the full string. This isn't fully vectorized yet but goes a fair step along the way eliminating the unnecessary big loop and switch block for the lookup table approach.

% set up lookup table inputs
DNA=['C','G','T','A'];  % base characters in DNA sequence
RNA=['G','C','A','U'];  % corresponding characters in RNA
% get input file, do bookkeeping for output file, etc., ..
filename=uigetfile('*.txt','Select DNA sequence file');
if filename==0, error('User Canceled'),end
[dnaFile, message] = fopen(filename, 'r');
% %if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
  outputFile = [filename(1:end-4) '.rna'];
else 
  outputFile = [filename '.rna'];
end
[rnaFile, message] = fopen(outputFile, 'w');
% now read full string into char array
dna=fread(dnaFile,'*char').';            % orient as row vector
fclose(dnaFile); clear dnaFile           % we're done with input file
% and translate DNA --> RNA
rna=RNA(arrayfun(@(c) find(c==DNA),dna));
% compute counts, frequencies
counts=zeros(size(RNA)).';
for i=1:length(RNA)
  counts(i)=sum(rna==RNA(i));
end
freqs=100*counts/sum(counts);
% output results
fprintf('Base pair frequencies: \n')
for i=1:length(RNA)
  fprintf('%c: %6.2f\n',RNA(i),freqs(i));
end
fwrite(rnaFile, rna);
rnaFile=fclose(rnaFile); clear rnaFile

Sample run from your dna.txt--

> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies: 
G:  14.63
C:  31.71
A:  26.83
U:  26.83
>>

The searching would be on the variable rna as outlined above after getting the desired sequence for which to search from the user.

I attached the m-file as well...HTH.

ADDENDUM

OK, I added the lookup; using the search string you typed in the original question the nul result is the expected answer...there is no sequence like that in the output string. I ran one and picked one I knew would work and results were:

>> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies: 
G:  14.63
C:  31.71
A:  26.83
U:  26.83
Enter RNA sequence of interest: aauu
Sequence AAUU locations (empty report-->not found)
Occurrence  Location
    1           3
    2          16
    3          28
>>

So, the lookup does work if there is anything to be found.

2 Comments
Show None Hide None

John Jamison on 25 Apr 2017

Open in MATLAB Online

Thank you. How can I get it to search for a user input of RNA basepairs?

I have the following, and thought it would work, but it always returns a " [] ", which means it is finding nothing...

inputBase = input('Enter RNA sequence of interest: ', 's');
location = strfind(out, inputBase);

dpb on 25 Apr 2017

Edited: dpb on 25 Apr 2017

Well, syntax looks ok but again we can't debug what we can't see, sorry.

Same old song, umpteenth verse. SHOW us enough we can see what you did and what you used to do it with (both code and data)...we can't see your terminal from here.

This can be done simply from command line with the output string you operated on and the search string as a minimum...

ADDENDUM

You did notice if you used my updated code that the output variable is now a more expressive name rna, not out, I hope?

If you're still stuck on your old code, need what said above...show us the string you're searching and for what...remember strfind is case-sensitive.

Sign in to comment.

searching through RNA sequence for a string of characters

0 Comments
Show -2 older comments Hide -2 older comments

Answers (3)

2 Comments
Show None Hide None

6 Comments
Show 4 older comments Hide 4 older comments

2 Comments
Show None Hide None

Categories

Products

Tags

Community Treasure Hunt

searching through RNA sequence for a string of characters

0 Comments Show -2 older comments Hide -2 older comments

Answers (3)

2 Comments Show None Hide None

6 Comments Show 4 older comments Hide 4 older comments

2 Comments Show None Hide None

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

2 Comments
Show None Hide None

6 Comments
Show 4 older comments Hide 4 older comments

2 Comments
Show None Hide None