File Exchange

image thumbnail

Fast Gene Sequence Search for Very Large Data File

version (3.85 KB) by Binlin Wu
Search for a specific sequence and record a neighboring code sequence with an offset position.


Updated 01 Jul 2011

View License

A research fellow at Harvard asked me to write a program to search for gene sequence, such as ‘TCC’, and record the next 4 codes. The data file was 14Gb. He tried some matlab codes, and the system froze, or kept running and never stopped.

I first tested using a loop method (V1.0). It turned out it would take a month to finish 14Gb data on my 1.8GHz Core 2 Duo/3Gb RAM PC. Then I updated it to use matrix. It turned out it would only take 1.3 hours on my 1.8Gb PC or 40 minutes on my 2.33GHz Core 2 Duo/2Gb RAM PC. It beat any codes that he got using Python or other languages.

I put the file here, and hopefully it will be useful to the people with the same situation.

Comments and Ratings (2)


chrish (view profile)

well done , nice work

Binlin Wu

Any comment or feedback is welcome.


Function renamed and all names made consistent.

Added a parameter nHL, which is used to specify how many headlines you want to remove from the data. It is 0 by default.

MATLAB Release Compatibility
Created with R2010b
Compatible with any release
Platform Compatibility
Windows macOS Linux

Discover Live Editor

Create scripts with code, output, and formatted text in a single executable document.

Learn About Live Editor