The Challenge is to Rapidly find matches of DNA sequences, Length=6, in a 1,800,000 long DNA file.
At IMACST the paper An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern claimed an astounding time improvement from 9.94 seconds to 7.84 seconds, 21% time reduction, to match six segments of length 6 in a 1.8M long DNA file. Basic probability asserts 1.8M/4^6 * 6 = 2637 matches. The paper's test case produced 2346 matches. The method employed used text processing in C++. The paper's L=25 and L=50 cases will be later challenges.
Matlab can achieve matching a six pattern set of L=6 in <15 msec (i5/16GB). This is merely a 99.8% time reduction.
Challenge Description: DNA is made of letters ACGT, wiki DNA, which for the purposes of this Matlab Cody Challenge are given values 0 thru 3. (ACGT= 0123)
Input: [DNA, DNA_ID, Patterns]
Output: Locations
Locations of all start indices that match any of the patterns
Scoring: Average Time (msec) for a block of L=6 patterns
Example:
Hints:
Coming soon: Genome DNA sequencing of PhagePhix174 and Haempphilus Influenza