Be the first to rate this file! 3 Downloads (last 30 days) File Size: 3.85 KB File ID: #31966

Fast Gene Sequence Search for Very Large Data File

by Binlin Wu

 

26 Jun 2011 (Updated 01 Jul 2011)

Search for a specific sequence and record a neighboring code sequence with an offset position.

| Watch this File

File Information
Description

A research fellow at Harvard asked me to write a program to search for gene sequence, such as ‘TCC’, and record the next 4 codes. The data file was 14Gb. He tried some matlab codes, and the system froze, or kept running and never stopped.

I first tested using a loop method (V1.0). It turned out it would take a month to finish 14Gb data on my 1.8GHz Core 2 Duo/3Gb RAM PC. Then I updated it to use matrix. It turned out it would only take 1.3 hours on my 1.8Gb PC or 40 minutes on my 2.33GHz Core 2 Duo/2Gb RAM PC. It beat any codes that he got using Python or other languages.

I put the file here, and hopefully it will be useful to the people with the same situation.

MATLAB release MATLAB 7.11 (2010b)
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (1)
29 Jun 2011 Binlin Wu

Any comment or feedback is welcome.

Please login to add a comment or rating.
Updates
27 Jun 2011

Added a parameter nHL, which is used to specify how many headlines you want to remove from the data. It is 0 by default.

01 Jul 2011

Function renamed and all names made consistent.

Tag Activity for this File
Tag Applied By Date/Time
gene Binlin Wu 27 Jun 2011 09:01:48
gene sequence Binlin Wu 27 Jun 2011 09:01:48
search Binlin Wu 27 Jun 2011 09:01:48
large data Binlin Wu 27 Jun 2011 09:01:48
read data Binlin Wu 27 Jun 2011 09:42:13
statistics Binlin Wu 27 Jun 2011 15:45:45
data export Binlin Wu 27 Jun 2011 15:45:45
data import Binlin Wu 27 Jun 2011 15:45:45
biotech Binlin Wu 27 Jun 2011 15:45:45
biotech Igor 17 Jan 2012 13:49:05

Contact us at files@mathworks.com