Help optimizing inefficient code

I'm relatively new to matlab, and I have a code for merging two datasets based on a common attribute (in this case names). However, my code is very inefficient, so I'd be grateful for any suggestions to make it more efficient.
The gist of it is this: I have two datasets. They have complementary data, and share certain attributes that I'd like to use to combine them. I'm reading Dataset 1 as a cell array. It contains promoters names (which I'll just call promoters), and a value associated with each.
I've been reading Dataset 2 as a table then turning it into cell array. Each row represents one gene, however each gene can possess no promoters, one promoter, or multiple promoters (which I'll just call 'Names' from now on) from dataset 1. I'd like to find a way to append the value in dataset 1 to it's associated identifier as well as gene in dataset 2. In dataset 2, if a gene has multiple promoters they are originally stored in a single string cell and are separated by: ' // '
Essentially:
Dataset 1 (10,000 x 2): 'Name' 'Value' i.e. dataset1= {'Name1', 2.32; 'Name2', 3.42}
Dataset 2 (5000 x 2): 'Gene' 'Name,Name' i.,e. dataset2 = {'Gene1', []; 'Gene2', 'Name1'; 'Gene3', 'Name2 // Name3'}
My solution to this was to split by separator.
for j=[2 3 4 5 6] %dataset2 actually has multiple columns of significance that need splitting, but not important now
for i=1:length(a)
if isempty(a{i,j}) == 1
continue
end
b = char(a{i,j});
c = strsplit(b,' // ');
a{i,j} = [c];
end
end
This converted cells with multiple promoter names into a cell array with promoter names (i.e. {'Gene' 1x3} where the 1x3 is = {'Name1' 'Name2' 'Name3'}.
My solution to merge the data was to use a for loop that assesses the size of that 1x3 cell (that could just be a single name), and search dataset1 for the associated name and append the associated value to dataset 2 in a manner such as:
Dataset 2: {'Gene1', 2x3} where the 2x3 = {'Name1', 'Name2', 'Name3'; 'Value1', Value2', 'Value3'}
Here is my code, I tried to annotate it to make it easier to follow:
for i=1:length(dataset2)
if isempty(dataset2{i,5}) == 1 % 0 'Names' associated w/ gene
continue
end
s = size(a{i,5});
if s(1,2) == 1 % One 'Name' associated w/ gene
x = a{i,5};
for j=1:length(dataset1)
y = dataset1{j,1};
if strcmp(x,y) == 1 % Using strcmp to find matching 'Names'
a{i,7} = dataset1{j,3};
a{i,8} = dataset1{j,4};
end
end
end
if s(1,2) > 1 % Multiple 'Names' associated with gene
r = a{i,5};
p = 1;
for rr=1:length(s(1,2))
x = r(1,rr);
for m=1:length(dataset1)
y = dataset1{m,1};
if strcmp(x,y) == 1
a{i,7}(2,rr) = dataset1{j,3};
a{i,8}(3,rr) = dataset1{j,4};
end
end
end
end
end
I'm sure this is a very convulted script, so any insight would be appreciated.

1 Comment

This part is not clear:
Dataset 1 (10,000 x 2): 'Name' 'Value' i.e. dataset1= {'Name1', 2.32; 'Name2', 3.42}
Dataset 2 (5000 x 2): 'Gene' 'Name,Name' i.,e. dataset2 = {'Gene1', []; 'Gene2', 'Name1'; 'Gene3', 'Name2 // Name3'}
What does this mean? Please provide the input data in a clear format. What is the variable a in the first code snippet? I guess, this is a simplification:
for j = 2:6
for i = 1:size(a, 1) % Safer than: length(a)
if ~isempty(a{i,j})
a{i,j} = strsplit(a{i,j}, ' // ');
end
end
end

Sign in to comment.

Answers (0)

Categories

Asked:

on 18 Feb 2019

Commented:

Jan
on 18 Feb 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!