Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Problem 939. DNA Pattern Match: Performance Metric - Speed

Created by Richard Zapor

The Challenge is to Rapidly find matches of DNA sequences, Length=6, in a 1,800,000 long DNA file.

At IMACST the paper An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern claimed an astounding time improvement from 9.94 seconds to 7.84 seconds, 21% time reduction, to match six segments of length 6 in a 1.8M long DNA file. Basic probability asserts 1.8M/4^6 * 6 = 2637 matches. The paper's test case produced 2346 matches. The method employed used text processing in C++. The paper's L=25 and L=50 cases will be later challenges.

Matlab can achieve matching a six pattern set of L=6 in <15 msec (i5/16GB). This is merely a 99.8% time reduction.

Challenge Description: DNA is made of letters ACGT, wiki DNA, which for the purposes of this Matlab Cody Challenge are given values 0 thru 3. (ACGT= 0123)

Input: [DNA, DNA_ID, Patterns]

  • DNA is a 1.8M long uint8 row vector of values 0 thru 3.
  • DNA_ID identifies the DNA segment being processed. Multiple calls using the same DNA_ID will be performed. The first call of a DNA_ID is not timed.
  • Patterns is an Nx6 uint8 array where each row corresponds to a search pattern.

Output: Locations

Locations of all start indices that match any of the patterns

Scoring: Average Time (msec) for a block of L=6 patterns

Example:

  • DNA= [0 1 2 3 3 2 1 0 1 2 3 3]
  • Pattern = [2 3 3 2 1 0]
  • Locations = [3]

Hints:

  • Vectorization (Base 4 to Base 10 of 6 character words)
  • Bitshift/Reshape to create all words
  • Logical Indexing

Coming soon: Genome DNA sequencing of PhagePhix174 and Haempphilus Influenza

Problem Group

Solution Statistics

7 correct solutions 11 incorrect solutions
Last solution submitted on Aug 25, 2013