An open exchange for the MATLAB and Simulink user community |
Hosted by The MathWorks |
Gene Splicing, Mid-Contest Analysisby Michael Weidman The contributors to this MATLAB Central contest deserve hearty congratulations for their hard work and clever strategies towards solving the gene transposition problem. The leaders' codes continue to get more sophisticated and scores continue to drop: most impressive. We thought that we would use this chance to take a step back from the immediate stresses and concerns of the contest to talk about why these optimizing algorithms for gene transposition are important to the larger scientific community, particularly the field of phylogenetics. When students lean about the classification of organisms, they are taught that every species is closely related to other species within its genus, every genus is closely related to others within its family, families are related to other families within the same order, and so on up the classification scheme. The concept of a "species" is fairly well-defined: it is usually a group of organisms that can reproduce with each other to create fertile offspring. The process becomes more difficult, however, whenever we try to identify exactly what a "genus" or a "family" is. Do we define two species as closely related if they have similar traits? Because both bats and birds fly, does that mean that they are to be lumped into the same classification? Are dolphins more closely related to sharks or to humans? Biologists have decided that these phenotypical similarities do not define the classification scheme, but rather that a shared genetic ancestry collects species into their respective groupings. Until recently, however, it has been impossible to measure a shared ancestry. We can guess at a species' relationship to other species by its physical appearance, the fossil record, and the environment in which it lives, but all of these techniques are at best limited and at worst impossible to quantify. Modern genetics represents a new tool for the field of phylogeny. Gene transposition is one of the building blocks by which an ancestor's genetic code is transformed into its descendent species. By determining how many transpositions separate the genetic codes of two related organisms, scientists can gain a clearer understanding of how they evolved and how closely related they are. Unfortunately, this is a difficult computational challenge. Given the large number of variables that control a typical gene transposition (the length of the gene segment that is removed, the location of its reinsertion, and whether its direction is reversed), there is a dauntingly large number of possibilities that allow one gene sequence to be transformed into another. Most research in computational biology has focused on one small group of possible transpositions, those that involve a reversal in the segment's order and its reinsertion into the original location. Even with this constraint on the possible operations that can be performed, finding the shortest sequence of reversals (which is the most likely sequence, from a biological perspective) has been difficult. Alberto Caprara of the University of Bologna has proven that the procedure is NP-Hard, which requires an algorithm whose number of operations increases exponentially with the length of the gene sequence to be ordered. By taking advantage of the fact that individual genes have an individual "polarity" (i.e.: a gene can be distinguished from the same gene written in reverse), computational biologists like Eric Tannier and Marie-France Sagot of the Universit Claude Bernard have been able to improve the efficiency of finding the optimum flipping sequence to an algorithm that performs less than n^2 operations. This contest treads on territory that is less covered than the flipping problem. We do not take advantage of the polarity of the constituent genes, and we allow for transpositions more complicated than a simple flip. As a result, we would expect to find shorter optimal transcription sequences, but it would be more computationally expensive to find them. Although the solution to the optimal gene transcription problem extends the possibility of quantifying the relationships between species of organisms to a precision never before imagined, it can never be a panacea to the problem of deciphering the genetic heritage of life. On the one hand, the genetic code is so large and complicated that modern computers can only look at gene transcriptions on very small portions of an organism's DNA. On the other hand, we can only examine the DNA from currently-existing or recently-extinct species. A complete understanding of evolutionary relationships may require an impossible comparison to the genetic code of a long-disappeared ancestor. Nonetheless, algorithms that can efficiently solve problems like those posed in this contest can represent a new and useful tool in the field of phylogeny and perhaps others. |
|
| Related Topics |
| New Products | Support | Documentation | Training | Webinars | Careers | Newsletters | RSS |
| Problems? Suggestions? Contact us at contest@mathworks.com | © 1994-2008 The MathWorks, Inc. Trademarks Privacy Policy |