Asked by William
on 25 May 2013

For a little background I am working on an iGEM project (genetic engineering) and I have a bunch of DNA sequences for proteins. I need to figure out the optimal order for them so that I can break them up so that the first chunk is 500 bp long and each following chunk is 20 bp from the previous part + 480 bp that are next. I need to order them so that we use our 500 bp chunks as optimally as possible and so that proteins span as few of these 500 bp chunks as possible.

I know how to write a costing function so that given an order for these proteins I can determine how good that order is but I am currently at a loss for how to get the global optimizer to be able to change this order. The only thing that matters is that I end up with something I can order so I could take 20 arguments with each one being a number for its position, a number that I could put in order, or a vector that just had the order of the proteins in it. I just don't know how to generate that structure.

If I use a number for each one the optimizer could try to put more than one at the same value and if I give the optimizer a very large cost for those so that it gets thrown out I suspect it won't be able to find a good solution since most of the solutions it tries will be not possible. As for having it just shuffle the order of a vector I have no idea how to do that.

Any help would be appreciated. Thank you

Here is my costing function The function gives a cost (where lower is better) based on the order chosen. The only input that needs to change is order and it is just a vector that contains the indexes into the genes cell array.

For the data I have right now I would need a vector that is 17 elements consisting of the integers 1 to 17 in any order but without duplication and allow the optimizer to try various permutations to come up with an optimal solution.

function [ cost ] = gene_cost( genes, order ) %UNTITLED Summary of this function goes here % Detailed explanation goes here % genes is a cell array where column 1 is the gene name and column 2 is the % DNA sequence, and column 3 is the length of the gene there is one row for each gene %order is just a vector that defines the order of the genes %ex [10 7 5 3...] temp = genes(order,:);

upper_interval = zeros(1,length(temp)); for i=1:length(temp) upper_interval(i)= sum(cell2mat(temp(1:i,3))); end lower_interval = [0 upper_interval(1:length(upper_interval)-1)];

min_cost = transpose(ceil(cell2mat(temp(:,3))/500));

dna = strjoin(transpose(temp(:,2)),''); dna_length = length(dna); %get first 500 bp and put it in the first chunk block = {}; piece = dna(1:500); block = vertcat(block, piece); genes_in_block = 1 < upper_interval & lower_interval < 500; cost = sum(genes_in_block,1);

for i = 481:480:dna_length if dna_length -i < 500 chunk_size = dna_length -i; else chunk_size = 499; end piece = dna(i:i+chunk_size); block = vertcat(block, piece); genes_in_block = i < upper_interval & lower_interval < i+chunk_size; cost = cost + sum(genes_in_block,1); end cost = sum((cost-min_cost).^4); %the power is used to give a high penalty for taking up more too many more blocks that necessary

Related Content

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi test

Learn moreOpportunities for recent engineering grads.

Apply Today
## 2 Comments

## Jonathan Epperl (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/76932#comment_150912

"Shuffling" a vector can be done with

permsorrandperm.I don't think I completely understand what you are trying to do though, could you have another go at explaining your problem, maybe without referring to genetics at all?

## Matt J (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/76932#comment_150936

I know how to write a costing function so that given an order for these proteins I can determine how good that orderPlease write that for us so that we can see it, too.