Problem 912. Genome Sequence 002: Introductory DNA Sequencing (Flipped Segments)
This Challenge series will evolve the complexity of Genome DNA Sequencing. DNA Sequencing and the Shot Gun Method will be naively simplified into Cody Challenges. Genome sizes is another interesting wiki page.
DNA is represented by symbols ACGT, which for Matlab will be encoded as 0123. The basic goal is to reconstruct the original serial string of ACGT given multiple short segments. Segments are gleaned from multiple copies of the Virus/Bacteria/Chromosome thus there are overlapping, duplicative, and flipped segments. There are potential errors and duplicative stretches in the created segments. Chromosome 20 in its 59,187,298 base pairs has a segment of 820 that is repeated in at least two locations. The data being non-random largely increase lengths of duplicative stretches.
Example: Genome = ACGTCGGCCATGGACATTACG
Given three overlapping pieces, ACGTCGGCCA,TACAGGTACCGG, and GACATTACG these can be readily seen to overlap and create the original if the middle is recognized as being flipped left-right.
ACGTCGGCCATGGACATTACG
ACGTCGGCCAsssssssssss
sssssTACAGGTACCGGssss Middle
sssssGGCCATGGACATssss Middle Reversed
ssssssssssssGACATTACG
Genome_002 Challenge is to reconstruct a genome under near ideal segment creation conditions. Some of the segments will be reversed. The output may be reversed.
- Segments may be flipped (Genome_002 change)
- Length of each segment - 48
- Segments begin at locations 1, 33, 65,...32N+1 (N=0:K, L_Genome=32K+48)
- All segments are provided once (Essentially two copies of a genome were cut into pieces with overlaps)
- No errors in the segments
- Genome is random (No duplicate starts or ends for 16 symbols of segments)
- Segment order will be scrambled
Input: segs, Array of M rows of 48 value segments. Values are [0, 1, 2, 3].
Output: Gout, Genome or fliplr(Genome) vector of values [0,1,2,3]
Example: [0 1 2 2; 1 3 2 2; 3 1 1 2] creates [0 1 2 2 3 1 1 2] M=3,W=4, Overlap=2; Middle segment was flipped
Future: Flipped segments(002), Random Position of Segment start locations, Extra Segments, Phage Phi X174, Parallel Processing Simulation(Shot Gun Approach), Haemophilus Influenza, Sequence with Segment Errors, and Chromosome 20 with its 59M length using 100K 4K-segments
Solution Stats
Problem Comments
-
1 Comment
Are Mjaavatten
on 19 Jul 2021
Be aware that one or more test cases are not shown on the "Solve" page, These may have different values for the parameter L.
Solution Comments
Show commentsProblem Recent Solvers5
Suggested Problems
-
3368 Solvers
-
Renaming a field in a structure array
1516 Solvers
-
193 Solvers
-
Back to basics 4 - Search Path
366 Solvers
-
82 Solvers
More from this Author308
Problem Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!