Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
“Row comparison speed challenge”

Subject: “Row comparison speed challenge”

From: Josh

Date: 1 Oct, 2010 17:41:09

Message: 1 of 9

Hi Everyone,

     I have two large arrays that I would like to compare. The arrays are of different lengths, and are filled with data rounded to the forth decimal place. The first Array (Called A) is the base array containing X Y coordinate data in the first and second columns respectively. The third column is used as a counter to keep track of the number of times that this particular row (X Y coordinate data of first two columns) is found in the B array. The B array also has X Y coordinate data in the first two columns, and the third column is filled with ones (Third column really isn’t needed in B matrix, as I am only comparing the X Y coordinates).

Please download the test arrays that I have mentioned by going to http://my.huddle.net . The username and password are both: userstorage . Once in, you should be able to click on A.txt, B.txt, and A_results.txt and download the files. (If this doesn't work, please e-mail me and I can send you the files)

   So I want to compare each row in B, to see if it exists in A. If it does exist, then I want to add one to the third column of that row in A. So far this is my code:

load A.txt
load B.txt

I=zeros(size(A,1),1);
J=zeros(size(A,1),1);
X=zeros(size(A,1),1);
 
tic
for a=1:size(B,1)
    I=A(:,1)==B(a,1);
    J=A(:,2)==B(a,2);
    X=(I.*J);
    Q=find(X);
    A(Q,3)=A(Q,3)+1;
end
toc

save('A_result.txt','A','-ascii', '-tabs')

Unfortunately, this takes way to long, (Elapsed time is 273.352197 seconds.) especially since I will continue to compare the base Array (A) in a loop, with hundreds of different B Arrays.

If anyone can help me to make this row by row comparison a lot quicker with the downloaded data, I would greatly appreciate it.

Also, some things to keep in Mind:
1.) Each Array is filled with unique X Y data, so you do not need to worry about duplicate points within an Array. (This also means that only one (or no) duplicate will be found during each loop of the code as I have written it.

2.) There is nothing special about the order of the data provided that each row remains intact, so sorting etc… is fine if it helps speed up the process (perhaps the size compared can be cut down in the for loop by recognizing that the next row in B will only be found below the previous solution row in A etc..)

3.) There is a good chance that some of the B arrays (data not included here), will not have any points in common, so if you have a way to perform a quick check to see if any points are the same before going into the loop, that would really help me out as well.

4.) Please test any code with the download files that I specified above to get the sample A array, B array, and solution array (A_result), as it is a realistic set of data that will help weed out “I think” (unfortunately) some of the quick nxn matrix type approaches that might otherwise be used for smaller sets of data.


Thanks a million for any help,

     Josh

Subject: “Row comparison speed challenge”

From: Sean

Date: 1 Oct, 2010 18:25:19

Message: 2 of 9

"Josh " <joshinbox@hotmail.com> wrote in message <i856fl$ov$1@fred.mathworks.com>...
> Hi Everyone,
>
> I have two large arrays that I would like to compare. The arrays are of different lengths, and are filled with data rounded to the forth decimal place. The first Array (Called A) is the base array containing X Y coordinate data in the first and second columns respectively. The third column is used as a counter to keep track of the number of times that this particular row (X Y coordinate data of first two columns) is found in the B array. The B array also has X Y coordinate data in the first two columns, and the third column is filled with ones (Third column really isn’t needed in B matrix, as I am only comparing the X Y coordinates).
>
> Please download the test arrays that I have mentioned by going to http://my.huddle.net . The username and password are both: userstorage . Once in, you should be able to click on A.txt, B.txt, and A_results.txt and download the files. (If this doesn't work, please e-mail me and I can send you the files)
>
> So I want to compare each row in B, to see if it exists in A. If it does exist, then I want to add one to the third column of that row in A. So far this is my code:
>
> load A.txt
> load B.txt
>
> I=zeros(size(A,1),1);
> J=zeros(size(A,1),1);
> X=zeros(size(A,1),1);
>
> tic
> for a=1:size(B,1)
> I=A(:,1)==B(a,1);
> J=A(:,2)==B(a,2);
> X=(I.*J);
> Q=find(X);
> A(Q,3)=A(Q,3)+1;
> end
> toc
>
> save('A_result.txt','A','-ascii', '-tabs')
>
> Unfortunately, this takes way to long, (Elapsed time is 273.352197 seconds.) especially since I will continue to compare the base Array (A) in a loop, with hundreds of different B Arrays.
>
> If anyone can help me to make this row by row comparison a lot quicker with the downloaded data, I would greatly appreciate it.
>
> Also, some things to keep in Mind:
> 1.) Each Array is filled with unique X Y data, so you do not need to worry about duplicate points within an Array. (This also means that only one (or no) duplicate will be found during each loop of the code as I have written it.
>
> 2.) There is nothing special about the order of the data provided that each row remains intact, so sorting etc… is fine if it helps speed up the process (perhaps the size compared can be cut down in the for loop by recognizing that the next row in B will only be found below the previous solution row in A etc..)
>
> 3.) There is a good chance that some of the B arrays (data not included here), will not have any points in common, so if you have a way to perform a quick check to see if any points are the same before going into the loop, that would really help me out as well.
>
> 4.) Please test any code with the download files that I specified above to get the sample A array, B array, and solution array (A_result), as it is a realistic set of data that will help weed out “I think” (unfortunately) some of the quick nxn matrix type approaches that might otherwise be used for smaller sets of data.
>
>
> Thanks a million for any help,
>
> Josh

I think this might win:

%%%
[~,idx,~] = intersect(A(:,[1 2]),B(:,[1 2]),'rows'); %Get intersecting rows of A
idx = sub2ind(size(A),idx,repmat(3,size(idx))); %Make list of indices into 3rd column
A(idx) = A(idx)+1; %Increment
%%%
%SCd

Subject: &#8220;Row comparison speed challenge&#8221;

From: Josh

Date: 2 Oct, 2010 06:59:04

Message: 3 of 9

"Sean " <sean.dewolski@nospamplease.umit.maine.edu> wrote in message
> I think this might win:
>
> %%%
> [~,idx,~] = intersect(A(:,[1 2]),B(:,[1 2]),'rows'); %Get intersecting rows of A
> idx = sub2ind(size(A),idx,repmat(3,size(idx))); %Make list of indices into 3rd column
> A(idx) = A(idx)+1; %Increment
> %%%
> %SCd
---------------------------------------------------------------------------------------------------------------------------
Thanks a lot Sean.
Brilliant! Wow, I can't believe how much faster this works. Running on the same computer I now get: (Elapsed time is 0.615979 seconds.). Definitely a winner!

  If I could pick your brain just one more time, I have another slight variation to this that I would like to try. The premise is the same, I have X Y coordinates in the base matrix (this time called A2, with counter in third column and I want to increment the counter based on comparison with second array now called B2.

 Now here is the twist, instead of having B2 filled with every X Y data point, B2 now contains sets of endpoints (points representing horizontal boundary markers). So now the code should look at the first two sets of points in B2, and see which points in A2 lie between the two points. Every point between the two should increment the counter.

Since B2 contains point pairs that scan horizontally, the Y value will always be the same on each boundary pair. So I went ahead and sorted the data in A2 and B2 using sortrows("Array",[2,1]). This way first scan the Y value, and then capture any A2 rows based on X values that lie between the B2 X values within the rows containing that Y value.

Again you can download large sample arrays at http://my.huddle.net with username and password both: userstorage. This time A2 and B2 are in a .zip file called Test2 (in the "What's new" tab)

But just as an example:

A2 =

 [0.0010 1.3333 0
  0.0020 1.3333 0
 -0.0010 1.3333 0
  0.0000 0.6667 0
 -0.0010 0.6667 0
  0.0000 1.3333 0
  0.0030 1.3333 0]

B2 =

  [ 0.0000 1.3333
    0.0020 1.3333]

         Solution should yield:

A2=

"Sean " <sean.dewolski@nospamplease.umit.maine.edu> wrote in message
> I think this might win:
>
> %%%
> [~,idx,~] = intersect(A(:,[1 2]),B(:,[1 2]),'rows'); %Get intersecting rows of A
> idx = sub2ind(size(A),idx,repmat(3,size(idx))); %Make list of indices into 3rd column
> A(idx) = A(idx)+1; %Increment
> %%%
> %SCd
---------------------------------------------------------------------------------------------------------------------------
Thanks a lot Sean.
Brilliant! Wow, I can't believe how much faster this works. Running on the same computer I now get: (Elapsed time is 0.615979 seconds.). Definitely a winner!

  If I could pick your brain just one more time, I have another slight variation to this that I would like to try. The premise is the same, I have X Y coordinates in the base matrix (this time called A2, with counter in third column and I want to increment the counter based on comparison with second array now called B2.

 Now here is the twist, instead of having B2 filled with every X Y data point, B2 now contains sets of endpoints (points representing horizontal boundary markers). So now the code should look at sets of points in B2, and see which points in A2 lie between the two points (inclusive). Every point in A2 that exist between the two points (or that is one of the points) in B2 should increment the counter.

Since B2 contains point pairs that scan horizontally, the Y value will always be the same on each boundary pair. So I went ahead and sorted the data in A2 and B2 using sortrows("Array",[2,1]). This way first scan the Y value, and then capture any A2 rows based on X values that lie between the B2 X values within the rows containing that Y value.

Again you can download large sample arrays at http://my.huddle.net with username and password both: userstorage. This time A2 and B2 are in a .zip file called Test2 (in the "What's new" tab)

But just as an example:

A2 =

   [-0.0010 0.6667 0
     0.0000 0.6667 0
    -0.0010 1.3333 0
     0.0000 1.3333 0
     0.0010 1.3333 0
     0.0020 1.3333 0
     0.0030 1.3333 0]

B2 =

  [-0.0040 0.6667
    0.0000 0.6667
    0.0000 1.3333
    0.0020 1.3333]

Solution:

A2 =

   [-0.0010 0.6667 1
     0.0000 0.6667 1
    -0.0010 1.3333 0
     0.0000 1.3333 1
     0.0010 1.3333 1
     0.0020 1.3333 1
     0.0030 1.3333 0]

Thanks again,
      Josh

Subject: &#8220;Row comparison speed challenge&#8221;

From: Sean

Date: 2 Oct, 2010 11:05:06

Message: 4 of 9


> Thanks a lot Sean.
> Brilliant! Wow, I can't believe how much faster this works. Running on the same computer I now get: (Elapsed time is 0.615979 seconds.). Definitely a winner!
>
> If I could pick your brain just one more time, I have another slight variation to this that I would like to try. The premise is the same, I have X Y coordinates in the base matrix (this time called A2, with counter in third column and I want to increment the counter based on comparison with second array now called B2.
>
> Now here is the twist, instead of having B2 filled with every X Y data point, B2 now contains sets of endpoints (points representing horizontal boundary markers). So now the code should look at the first two sets of points in B2, and see which points in A2 lie between the two points. Every point between the two should increment the counter.
>
> Since B2 contains point pairs that scan horizontally, the Y value will always be the same on each boundary pair. So I went ahead and sorted the data in A2 and B2 using sortrows("Array",[2,1]). This way first scan the Y value, and then capture any A2 rows based on X values that lie between the B2 X values within the rows containing that Y value.
>
> Again you can download large sample arrays at http://my.huddle.net with username and password both: userstorage. This time A2 and B2 are in a .zip file called Test2 (in the "What's new" tab)
>
> But just as an example:
>
> A2 =
>
> [0.0010 1.3333 0
> 0.0020 1.3333 0
> -0.0010 1.3333 0
> 0.0000 0.6667 0
> -0.0010 0.6667 0
> 0.0000 1.3333 0
> 0.0030 1.3333 0]
>
> B2 =
>
> [ 0.0000 1.3333
> 0.0020 1.3333]
>
> Solution should yield:
>

It's the weekend so I'm on the wagon as far as abstaining from MATLAB. Here is logic for one idea, it'll need testing:
C = concatenate A2,B2
C = sortrows(C,[2 1]);

[~,idC,~] = intersect(C,B2); %find the indices in A2 that contain B2
points_between = diff(idB)-1; %Difference between row indices - 1 will be the number of points between. You're going to have to decide what to do with points that lie on the boundary. Are they part of the first or second basin or are they part of both basins or neither basin? From here just play with that difference on a small test matrix and see how to make it meet your needs.

Good luck!

Subject: &#8220;Row comparison speed challenge&#8221;

From: Sean

Date: 4 Oct, 2010 13:48:05

Message: 5 of 9

Actually:
doc histc

Subject: &#8220;Row comparison speed challenge&#8221;

From: Josh

Date: 7 Oct, 2010 20:28:04

Message: 6 of 9

--------------------------
Hi Sean,

     Thanks for the response. You asked whether or not to include the points that lie on the boundary, the answer is yes, I want to include these points as well if they exist within A2. I think I am following the first few steps, but am not sure how to map the points back to A2 based on indices taken from C. So far this is how I understand it:

1.) First create array C, which contains A2 and B2 (Reason for doing so, is that there may be points in B2 that lie outside of A2, and we need to capture those).

2.) C is sorted by Y (column 2) and then X (Column 1), so that points can easily be determined by first matching Y and then taking points between and including those matched in X.

3.) The intersection points between C and B2 are located with index idC

  From this point on, I am a bit lost. Since everything thus far has been indexed according to the new array C, how do I then correlate these points back to the original A2 array? (In your code, the last line (points_between) references idB, so I'm guessing that you already have an idea, perhaps something like [~,idB,~] = intersect(A2,B2), but I don't see how the two indices idC and idB can then be used to capture the data on A2 that is between and including the points in B2.).

   Sorry if I'm a bit slow, I'm a bit shaky on indexing logic.

    Thanks,
         Josh

Subject: &#8220;Row comparison speed challenge&#8221;

From: Sean

Date: 7 Oct, 2010 21:47:03

Message: 7 of 9


> Sorry if I'm a bit slow, I'm a bit shaky on indexing logic.

The only way to learn is to practice!

Okay, screw that whole idea from before, here's another way to do it. I didn't try it on your real data, but it works on the test case provided in this thread.

%%%
[~,~,ida] = unique(A2(:,2)); %unique 2nd columns
bnds = reshape(B2(:,1),[],2); %bounds where each row is the corresponds to that unique edge
bnds = mat2cell(bnds,ones(size(bnds,1),1),2); %make it a cell

AccA = accumarray(ida,A2(:,1),[],@(x){x}); %group all As
A2(:,3) = A2(:,3)+cell2mat(cellfun(@(x,y)(x>=y(1)&x<=y(2)),AccA,bnds,'uni',false));
%Find ones that are in between the edges, set them to true and add them in.

NOTE: This makes some assumptions about your data and will need to be modified if these criteria aren't true.
-Both A2 and B2 HAVE to be sorted the way they have been
-Both A2, B2 both HAVE to have the same edges. I.e. A2 can't have for example [ .1 1.75; 1.2 1.75; 1.6 1.75] and B2 not have 1.75. B2 HAS to have two edges for each edge in A2. If this isn't the case, you can either add two edges into B2 with first column that are both below or both above so that none of the ones in A2 are between. -Or we can look at code modification, which is surely doable.

Let me know if these are issues.
Good Luck!

Subject: &#8220;Row comparison speed challenge&#8221;

From: Josh

Date: 8 Nov, 2010 06:04:03

Message: 8 of 9

Hi Sean,

    I really appreciate the help. I've been extremely busy lately, and haven't had a chance to respond to this last message until now.

    Unfortunately the current code will not work because it does not fit the assumption you mentioned, as the edges of B2 will often exist outside those of A2.
 
  As a result, it may be easiest to first check which (if any) of the Y-coordinates in B2 (2nd column), also exist in A2 (2nd column). If they do exist, then a comparison can be made of the X-coordinates (1st column) to see which (if any) of the X-coordinates in A2 are the same as or between the X-coordinates in B2.

Also, keep in mind, that I am in control of how the data of each array is displayed and organized, so if you can think of an easier way to organize and compare the data (perhaps separating all coordinates with the same Y for example into separate arrays within a struct... then that is doable).

  Josh

Subject: &#8220;Row comparison speed challenge&#8221;

From: Josh

Date: 11 Nov, 2010 00:21:03

Message: 9 of 9

Okay, so I went ahead and programmed a solution for my question. However, I was wondering if someone could vectorize the last loop for me to make this go much faster:

Again, the A2 and B2 data files can be downloaded as a .zip file called Test2 (in the "What's new" tab) by going to: http://my.huddle.net The username and password are both: userstorage

Here is my code:

%---------------------------------------------------------------------
% This first part is just preprocessing, as I move the geometry into a
% new position, and delete some of the data, to ensure that all cases
% are taken into consideration
%---------------------------------------------------------------------

clear
clc
load A2.txt
load B2.txt

% Delete some of the data
X=A2(:,1)>.01 & A2(:,1)<.02;
Y=A2(:,2)>-.01 & A2(:,2)<.01;
Zip=X.*Y;
ZIP_i=find(Zip);
A2(ZIP_i,:)=[];

plot(A2(:,1),A2(:,2),'.','color','k')
hold all
B2(:,2)=B2(:,2)-.0075;
B2(:,1)=B2(:,1)+.0157;
plot(B2(:,1),B2(:,2),'.','color','b')

%---------------------------------------------------------------------
% This part of the code re-assembles the 'B2' array into a struct
% containing two bounding points in each cell, since the points
% are horizontal (same Y) Each cell has [X1, X2, Y], where X1 is the
% lower X bound, X2 is the upper X bound, and Y is the Y coordinate
%---------------------------------------------------------------------

current=1;
for i=1:length(B2)/2
    B2_bounds{i}(1,1)=round(B2(current,1)*10000)/10000;
    B2_bounds{i}(1,3)=round(B2(current,2)*10000)/10000;
    current=current+1;
    B2_bounds{i}(1,2)=round(B2(current,1)*10000)/10000;
    current=current+1;
end

% This ensures that all decimals are rounded to .0001"
A2(:,1)=round(A2(:,1)*10000)/10000;
A2(:,2)=round(A2(:,2)*10000)/10000;

%---------------------------------------------------------------------
% Now here is the code that I want vectorized
%---------------------------------------------------------------------

tic
for b=1:length(B2_bounds) % This cycle will not be needed in actual code, as the check will be made after each line segment is made
    [row,~]=find(A2(:,2)==B2_bounds{b}(1,3)); %First find rows with same Y value
    if isempty(row)~=1 % If there are no Y values, go to the next Y value
        plot(B2_bounds{b}(1,1),B2_bounds{b}(1,3),'s','color','y')
        plot(B2_bounds{b}(1,2),B2_bounds{b}(1,3),'s','color','y')
        Filter=(A2(row,1)>=B2_bounds{b}(1,1)).*(A2(row,1)<=B2_bounds{b}(1,2)); % Now see which X values in A2 are between the X values in B2_bounds
        rowFilter=Filter.*row; % rowFilter is now a subset of row and Filter (containing both the correct Y and X values)
        rowFilter = rowFilter(rowFilter ~= 0); % Remove all zeros from rowFilter
        A2(rowFilter,3)=A2(rowFilter,3)+1; % Add 1 to the Z axis (3rd column of A2), as these are the points within A2 that are bound by B2_bounds
        plot(A2(rowFilter,1),A2(rowFilter,2),'.','color','m')
    end
end
toc

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us