The Melopsittacus undulates genome, Parrot Budgerigar, was successfully sequenced in July 2012 using long 3rd Gen sequences provided by PacBio. The Assemblathon Genome Contest led the team of Phillippy, Koren and Jarvis to successfully Sequence Parrot DNA using the PacBio 3rd Generation data and Illumina 2nd Gen data.
The 3rd gen PacBio data is very long, 1K-20K, but has 15% error rate. The Illumina data is 100-500 long with <1% error rate. Jarvis and his team combined this data to achieve < 0.1% error rate.
Genome Challenge 004 is the correction of simplified PacBio simulated reads with high error rate.
Call 1: empty array, segment Width, Flag=0
Call 2: N PacBio DNA vectors (N x width), Segment Width, Flag=1
Call 1: empty vector, Number of Requested Vectors
Call 2: Corrected DNA vector, Number of Requested Vectors
Score: Number of N vectors used to produce correct vector for w=1024 case
The first call to the PacBio_fix routine returns the number of vectors requested to produce a final product. This may be a function of w.
The second call to PacBio_fix will have a DNA matix (N x width) and flag=1.
The response to the second call is the fixed DNA sequence, vector of width w.
example: First call return : N=3
01230123111122223333 Truth Input example 01232123112122221332 Injected errors 01130123111122123323 11230133121122223333
Output: 01230123111122223333 Truth, hopefully
This data is simplified by only having simple substitutions and the data sets are provided pre-aligned.
The real PacBio data is quite a bit more complicated. Values may be added, deleted, substituted, and are of varying lengths. This causes alignment issues.
Follow-Up Challenges: Sample Data from the PacBio site for Lambda Phage will be molded into various Challenges. Possible challenges are correcting individual long segments and assembling multiple long segments into the full Lambda Phage genome. The Parrot genome is too big for Cody to solve in 50 seconds.