Fast Gene Sequence Search for Very Large Data File
by Binlin Wu
26 Jun 2011
(Updated 01 Jul 2011)
Search for a specific sequence and record a neighboring code sequence with an offset position.
|
Watch this File
|
| File Information |
| Description |
A research fellow at Harvard asked me to write a program to search for gene sequence, such as ‘TCC’, and record the next 4 codes. The data file was 14Gb. He tried some matlab codes, and the system froze, or kept running and never stopped.
I first tested using a loop method (V1.0). It turned out it would take a month to finish 14Gb data on my 1.8GHz Core 2 Duo/3Gb RAM PC. Then I updated it to use matrix. It turned out it would only take 1.3 hours on my 1.8Gb PC or 40 minutes on my 2.33GHz Core 2 Duo/2Gb RAM PC. It beat any codes that he got using Python or other languages.
I put the file here, and hopefully it will be useful to the people with the same situation. |
| MATLAB release |
MATLAB 7.11 (2010b)
|
|
Tags for This File
|
| Everyone's Tags |
|
| Tags I've Applied |
|
| Add New Tags |
Please login to tag files.
|
| Comments and Ratings (1) |
| 29 Jun 2011 |
Binlin Wu
|
|
|
| Updates |
| 27 Jun 2011 |
Added a parameter nHL, which is used to specify how many headlines you want to remove from the data. It is 0 by default. |
| 01 Jul 2011 |
Function renamed and all names made consistent. |
|
Contact us at files@mathworks.com