Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Problem 863. Genome Sequence 001: Introductory DNA Sequencing

Created by Richard Zapor

This Challenge series will evolve the complexity of Genome DNA Sequencing. DNA Sequencing and the Shot Gun Method will be naively simplified into Cody Challenges. Genome sizes is another interesting wiki page.

DNA is represented by symbols ACGT, which for Matlab will be encoded as 0123. The basic goal is to reconstruct the original serial string of ACGT given multiple short segments. Segments are gleaned from multiple copies of the Virus/Bacteria/Chromosome thus there are overlapping and duplicative segments. There are potential errors and duplicative stretches in the created segments. Chromosome 20 in its 59,187,298 base pairs has a segment of 820 that is repeated in at least two locations. The data being non-random largely increase lengths of duplicative stretches.

Example: G = ACGTCGGCCATGGACATTACG

Given three overlapping pieces, ACGTCGGCCA, GGCCATGGACAT, and GACATTACG these can be readily seen to overlap and create the original.

ACGTCGGCCATGGACATTACG
ACGTCGGCCAsssssssssss
sssssGGCCATGGACATssss
ssssssssssssGACATTACG

Genome_001 Challenge is to reconstruct a genome under ideal segment creation conditions.

  1. Length of each segment - 48
  2. Segments begin at locations 1, 33, 65,...32N+1 (N=0:K, L_Genome=32K+48)
  3. All segments are provided once (Essentially two copies of a genome were cut into pieces with overlaps)
  4. Segments read left to right (no flips)
  5. No errors in the segments
  6. Genome is random (No duplicate starts or ends for 16 symbols of segments)
  7. Segments will be scrambled

Input: segs, Array of M rows of 48 value segments. Values are [0, 1, 2, 3].

Output: Gout, Genome vector of values [0,1,2,3]

Example: [0 1 2 2; 2 2 3 1; 3 1 1 2] creates [0 1 2 2 3 1 1 2] M=3,W=4, Overlap=2

Future: Flipped segments, Random Position of Segment start locations, Extra Segments, Phage Phi X174, Parallel Processing Simulation(Shot Gun Approach), Haemophilus Influenza, Sequence with Segment Errors, and Chromosome 20 with its 59M length using 100K 4K-segments (Matlab - 19.2 sec single thread benchmark)

Tags

Problem Group

Solution Statistics

2 correct solutions 2 incorrect solutions
Last solution submitted on Sep 18, 2013