An open exchange for the MATLAB and Simulink user community |
Hosted by The MathWorks |
Summary of the MATLAB Gene Sequencing Programming ContestIn the recent week long MATLAB programming contest, about 800 entries went through the contest queue, resulting in three intermediate prize winners and one final winner.In addition to looking at this analysis (which focuses on the techniques used in the winning entry), you might want to look at the discussion areas to see other interesting threads related to the MATLAB Contest.
The Scoring FunctionWe wanted the scoring function to strongly favor sequence length, and to consider time factors only when separating entries with similar performance. The scoring function looked like:
score = cputime + totalscore;
Instead of displaying "totalscore" on our Rankings page, we displayed "result", a normalized measure how how short your sequence was. "result" was computed with:
result = totalscore/Our_Length*100;
(As an aside, our experience with previous contests shows that hiding details like this doesn't always prevent the entries from "evolving" to be suited to the specific trials presented. However, it's not clear if that happened here, since the best entries did fairly well across many different types of tests.)
The Test SuiteA MAT-file containing the segments and the original piece can be found here.This MAT file contains two cell arrays. The one_answer{i} contains the original sequence used to create the fragments stored in testsuite{i}. There were 29 tests, with original lengths ranging from 10 up to 23456 characters.
Overview of the Techniques Used in the Top EntriesThe majority of the contest entries used some type of brute force method to solve this problem, which is generally thought to be NP complete.This typically meant comparing all the segments to each other to determine how to make the best match. While time wasn't weighted nearly as strongly as the scoring function, time was initially important because it was difficult to get a slow brute force algorithm to pass the testsuite without timing out. Here were a few common ideas we saw in the top entries; we've pointed out the code related to these ideas further below where we've "dissected" the code.
The Winning EntryThe first puzzle related to analyzing the winning entry was determining what GFD stood for! Francois Glineur, our winner, told us that GFD stood for Gene-Francois-Deborah (the latter being his wife's first name).The code for the winning entry is below. In some places, we've reformatted the code with "..." and newlines to make it more readable. The code begins with:
function Sequence = GFD6(Segments) if size(Segments, 1) > 200This IF statement is the cutoff swich statement to determine which algorithm to use for accumulating the final result. Next we had:
lSegs = size(Segments, 2); [SortedHash Order] = sort(Segments * randn(lSegs, 1)); startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';These lines remove the duplicate fragments, according to the hashing-type algorithm we discussed earlier. Notice that the sort is only performed on a 1-d vector, instead of on an lSegs-d array. The multiplication 'Segments * randn(lSegs,1)' maps lSegs dimensions into 1 dimension.
endSegs = startSegs; nSegs = length(startSegs);These lines set up some work vectors to be carried around for updating information. Next we need to set up a cell array to carry information about the combinations needed to produce our final result:
Description = cell(nSegs, 1);
for index = startSegs
Description{index} = index;
end
While this code works well, a minor problem in this code is that this
pre-allocated size may not be large enough to hold the final
result.Below we begin the routine for finding which segments have the best (greatest) overlap:
matchLength = lSegs-1; leftSegs = nSegs;Since there are no identical fragments, the longest match possible is lSegs-1. Now we begin to loop over the possible shifts to find the best match:
while matchLength > 0
RandWeight = randn(matchLength, 1);
LeftHash = Segments(endSegs, lSegs-matchLength+1:end) * RandWeight;
RightHash = (Segments(startSegs, 1:matchLength) * RandWeight)';
The RandWeight vector is the vector which maps the lSegs matching
problem into a 1-D problem, similar to what was done to remove
duplicate entries.Next we'll loop over the uncombined segments and check to see if there are any matches:
indexLeft = 1;
while indexLeft <= leftSegs
Match = findstr(LeftHash(indexLeft), RightHash);
if Match
indexRight = Match(1);
After checking for matches, we check for a special case:
if (indexRight == indexLeft) & (length(Match) > 1)
% Rare case, happens e.g. with strvcat('AAA','AAG','AGC')
indexRight = Match(2);
end
Did you remember that we're still inside an IF statement here? In this
section of the IF, we don't actually combine the segments when a match
is found. Instead we store information about how to combine the
segments and do the combination at the end of the routine.The following code updates the Description array to record the combining pairs and the amount by which they're shifted; it also updates the related working vectors:
if indexRight ~= indexLeft
Description{startSegs(indexLeft)} = cat(2,...
Description{startSegs(indexLeft)}, lSegs-matchLength, ...
Description{startSegs(indexRight)});
startSegs(indexRight) = startSegs(indexLeft);
startSegs(indexLeft) = []; endSegs(indexLeft) = [];
leftSegs = leftSegs - 1;
RightHash(indexRight) = RightHash(indexLeft);
RightHash(indexLeft) = []; LeftHash(indexLeft) = [];
If there's nothing left to track, we break out of this routine:
if leftSegs == 1
matchLength = 1;
break;
end
else
Here's the rest of the "if indexRight ~= indexLeft" conditional:
indexLeft = indexLeft + 1;
end
else
indexLeft = indexLeft + 1;
And now we end the "if Match" conditional:
end
Now we can end the "while indexLeft <= leftSegs" loop:
end
One more step to reduce matchLength:
matchLength = matchLength - 1;
Now we can end the "while matchLength > 0" conditional.
endBut we're not finished yet! Now we need to combine the segments according to the Description cell array created. We'll start by attaching all nonmatching fragments to the end of the sequence:
if length(startSegs) > 1
% If disjoint segments in the end, we have to merge them !
for index = 2:length(startSegs)
Description{startSegs(1)} = [Description{startSegs(1)},...
lSegs, Description{startSegs(index)}];
end
startSegs = startSegs(1);
end
Now we actually combine the segments - check out the line used here:
Sequence(repmat(cumsum(cat(2, 0, ...
Description{startSegs}(2:2:2*nSegs-2)))',...
1, lSegs) + repmat(1:lSegs, nSegs, 1))...
= Segments(Description{startSegs}(1:2:2*nSegs-1), :);
Pretty impressive! This code snippet is a great example of
vectorizing. REPMAT, CUMSUM, and the colon operator are all used in
this line, making for a very efficient implementation.Keeping in mind we're still inside an IF conditional, the above section of code only executes for segments of length greater than 200. The following block of code beginning with the ELSE describes the algorithm for used for shorter segments; it combines segments as soon as a match is found:
else
[nSegs, lSegs] = size(Segments);
[SortedHash Order] = sort(Segments * randn(lSegs,1)) ;
Segments = Segments(Order(diff(cat(1,SortedHash,0)) ~= 0), :);
[nSegs, lSegs] = size(Segments);
theSegments = cell(nSegs, 1);
theSegs = 1:nSegs;
for index = theSegs,
theSegments{index} = Segments(index, :);
end
copySegments = theSegments;
matchLength = lSegs-1;
while matchLength > 0
Segments(:, 1) = [];
for leftSeg = theSegs
The following line is a very computationally expensive line. If you
profile this function, you'll see that a majority of the time spent in
this routine is spend on this line:
Match = find(strncmp(Segments(leftSeg, :), copySegments, matchLength));
if Match
Select = Match(1);
rightSeg = theSegs(Select);
if rightSeg == leftSeg
if length(Match) > 1 % Rare case
temp = Match(Match ~= Select);
Select = temp(1);
rightSeg = theSegs(Select);
Here's one of the code sections that does the actual combining:
theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ...
theSegments{rightSeg}(1+matchLength:end));
i = find(theSegs == leftSeg);
theSegs(i) = [];
if length(theSegs) == 1
Sequence = theSegments{theSegs};
return;
end
copySegments(Select) = copySegments(i);
copySegments(i) = [];
end
This section of code is then followed by:
else
theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ...
theSegments{rightSeg}(1+matchLength:end));
i = find(theSegs == leftSeg);
theSegs(i) = [];
if length(theSegs) == 1
Sequence = theSegments{theSegs};
return;
end
copySegments(Select) = copySegments(i);
copySegments(i) = [];
end
end
end
matchLength = matchLength - 1;
end
Sequence = [theSegments{theSegs}];
end
Take a look at these two sections of code. They have a lot in common;
their structure is very similar. It'd be interesting to see if it's
possible to remove any redundancies between the code to improve the
performance yet again.We hope you've enjoyed this analysis of the code. If you have more questions on it, you may want to contact Francois, his contact information is listed next to his winning entry (still in the Top 20 Entries) .
|
|
|||||||
| Related Topics |
| New Products | Support | Documentation | Training | Webinars | Careers | Newsletters | RSS |
| Problems? Suggestions? Contact us at contest@mathworks.com | © 1994-${current_year} The MathWorks, Inc. Trademarks Privacy Policy |