File Exchange

image thumbnail

Fast Gene Sequence Search for Very Large Data File

version 1.6 (3.85 KB) by

Search for a specific sequence and record a neighboring code sequence with an offset position.

1 Download

Updated

View License

A research fellow at Harvard asked me to write a program to search for gene sequence, such as ‘TCC’, and record the next 4 codes. The data file was 14Gb. He tried some matlab codes, and the system froze, or kept running and never stopped.

I first tested using a loop method (V1.0). It turned out it would take a month to finish 14Gb data on my 1.8GHz Core 2 Duo/3Gb RAM PC. Then I updated it to use matrix. It turned out it would only take 1.3 hours on my 1.8Gb PC or 40 minutes on my 2.33GHz Core 2 Duo/2Gb RAM PC. It beat any codes that he got using Python or other languages.

I put the file here, and hopefully it will be useful to the people with the same situation.

Comments and Ratings (2)

chrish

chrish (view profile)

well done , nice work

Binlin Wu

Any comment or feedback is welcome.

Updates

1.6

Function renamed and all names made consistent.

1.2

Added a parameter nHL, which is used to specify how many headlines you want to remove from the data. It is 0 by default.

MATLAB Release
MATLAB 7.11 (R2010b)

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video