how to find percentage of similarity between two arrays.

Question

aditya sahu on 10 Mar 2017

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/329104-how-to-find-percentage-of-similarity-between-two-arrays

Commented: Emu on 29 Oct 2022

Suppose x=[1 0 1 0],y=[1 1 1 0] here, if i compare individual elements of x with y, then the highest matching (i have to consider from the beginning of x)is at 3rd and 4th of 2nd array. so the percentage of matching is 50% . how to write matlab code for this.

Imp.:-The point is how much % of x starting from 1st element and serially is matching with y.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Jan on 14 Mar 2017

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/329104-how-to-find-percentage-of-similarity-between-two-arrays#answer_258618

Edited: Jan on 14 Mar 2017

I want to share my guess also, perhaps it matchs your needs:

% Method 1:
x  = [1 1 0 0]; 
y  = [1 1 1 0 1 1 0 1];
nx = length(x);
ny = length(y);
yy = [y, nan(1, nx - 1)];  % Append nans to compare the last values
p  = ones(1, ny);          % Pre-allocation [EDITED, was: zeros()]
for iy = 1:ny              % Loop over substrings of y
  match = find(yy(iy:iy + nx - 1) ~= x, 1);
  if ~isempty(match)       % Any match is found
    p(iy) = 100 * (match - 1) / nx;
  end
end
[maxP, maxPos] = max(p)    % Highest in % value and index

If this solves your problem, it is time to search for optimizations.

[EDITED2] Simplified:

% Method 2:
nx = length(x);
ny = length(y);
yy = [y, nan(1, nx - 1)];  % Append nans to compare the last values
p  = zeros(1, ny);         % Pre-allocation (here zeros() is fine)
for iy = 1:ny              % Loop over substrings of y
  p(iy) = find([(yy(iy:iy + nx - 1) ~= x), true], 1) - 1;
end
[maxP, maxPos] = max(100 * p / nx)  % Highest value in % and index

If y is long, it might be cheaper to iterate over the substrings of x:

Note: strfind operates on double vectors directly, as long as they have a row shape. Then:

% Method 3:
p  = zeros(size(y));
nx = length(x);
for ix = 1:nx                       % Loop over substrings of x
  p(strfind(y, x(1:ix))) = ix;      % STRFIND accepts double vectors
end
[maxP, maxPos] = max(100 * p / nx);

Because only the longest match is wanted, it is cheaper to start with the complete x and stop the loop when the first match is found:

% Method 4:
maxP = 0;
for ix = length(x):-1:1           % Start with complete x
  maxPos = strfind(y, x(1:ix));   % Search in y
  if any(maxPos)                  % Success 
    maxP = 100 * ix / length(x);
    break;
  end
end

Now maxPos contains all indices of the occurrences of the longest substring.

Sorry for posting multiple versions. I thought seeing the steps of development might be interesting.

22 Comments
Show 20 older commentsHide 20 older comments

Jan on 15 Mar 2017

Edited: Jan on 16 Mar 2017

@aditya sahu: What should happen if the match is found mutliple times?

Let's start with a modified Method 4 packed into a function:

function [maxP, maxPos, maxLen] = FindSubMatch(x, y)
maxLen = 0;
for ix = length(x):-1:1           % Start with complete x
  maxPos = strfind(y, x(1:ix));   % Search in y
  if any(maxPos)                  % Success 
    maxLen = ix;
    break;
  end
end
maxP = 100 * maxLen / length(x);
end

Now this can be called from another function:

function [maxP, maxPos] = FindSubMatchRecursive(x, y)
nx     = length(x);
maxP   = zeros(1, nx);
maxPos = zeros(1, nx);
index  = 0;
while nx > 0
  [maxPi, maxPosi, maxLen] = FindSubMatch(x, y);
    % Store results:
    index         = index + 1;
    maxP(index)   = maxPi;
    maxPos(index) = maxPosi(1);  % The first match only?
    % Prepare data for next iteration:
    x  = x(maxLen + 1:nx);       % Crop the found part
    nx = length(x);
    y(1:maxPosi(1) - 1) = NaN;   % Mask leading values
  end
  maxP   = maxP(1:index);        % Crop unused elements [EDITED]
  maxPos = maxPos(1:index);
  end

I've overwritten the initial part of y in each iteration to consider the "its position where it is matched". But I'm not sure, if this is exactly the wanted procedure. Just try it and modify this to satisfy your needs.

Jan on 16 Mar 2017

And the next version:

function [P, Pos] = FindSubMatchRecursive(x, y)
nx    = length(x);
nx0   = nx;
P     = zeros(1, nx);
Pos   = zeros(1, nx);
index = 0;
found = 0;
while nx > 0
   [Leni, Posi] = FindSubMatch(x, y);
     % Store results:
     index      = index + 1;
     found      = found + Leni;  % Number of found elements
     P(index)   = 100 * found / nx0;
     Pos(index) = Posi(1);       % The first match only?
     % Prepare data for next iteration:
     x  = x(Leni + 1:nx);        % Crop the found part
     nx = length(x);
  end
  P   = P(1:index);              % Crop unused elements
  Pos = Pos(1:index);
  end
function [Len, Pos] = FindSubMatch(x, y)
Len = 0;
for ix = length(x):-1:1         % Start with complete x
   Pos = strfind(y, x(1:ix));   % Search in y
   if any(Pos)                  % Success
      Len = ix;
      break;
   end
end
end

I've cleaned it up a bit, e.g. removed the many "max" before the names. Now:

x = [1 0 1 0 1 1 1];
y = [1 1 0 0 0 0 0 0 0 ];
[P, Pos] = FindSubMatchRecursive(x, y)
>> P = 28.571       57.143       85.714          100
>> Pos =  2     2     1     1

Kind regards, Jan

Image Analyst on 13 Oct 2022

@Emu, I don't have your data but if it's a column vector (vertical), try

yy = [whoSpeakCoder; nan(nx - 1, 1)];  % Append nans to compare the last values
% or else you could make yy a row vector instead if you want:
yy = [reshape(whoSpeakCoder, 1, []), nan(1, nx - 1)];  % Append nans to compare the last values

Emu on 29 Oct 2022

thank you! yes worked :)

Sign in to comment.

Answer 2

John BG on 14 Mar 2017

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/329104-how-to-find-percentage-of-similarity-between-two-arrays#answer_258714

Edited: John BG on 16 Mar 2017

Hi Adithya

This is John BG ( <mailto:jgb21012@sky.com jgb21012@sky.com> )

ok, got it

1.

The percentages are implicit in the mask applied to x.

let be

x2=[1 1 0 0];             % pattern to find
y2=[1 1 1 0 1 1 0 1]      % signal

then

if sought string is [1 1 0 0] then percentage=100%

if sought string is [1 1 0] then percentage=75%

if sought string is [1 0 0] then percentage=75%

if sought string is [1 1] then percentage=50%

if sought string is [1 0] then percentage=50%

if sought string is [0 0] then percentage=50%

it's agreed that no single bits are sought, right?

so, my suggestion is to address the percentage 1st.

the basic processing is:

clear all
x2=[1 1 0 0];             % pattern
y2=[1 1 1 0 1 1 0 1]      % signal     
maskx=[1:3]                % assign percentage here
                           % but only continuous bits, ok?
x=num2str(x2(maskx));
y=num2str(y2);
x(x==' ')=[];y(y==' ')=[];
n=strfind(y,x)

2.

sweeping all percentages

x=[1 1 0 0]
y=randi([0 1],1,20)
N=numel(x)
N=4;
maskx=[1:N]
v={}
for q=1:1:N-1
     for s=1:q
            v=[v maskx([1+s-1:N-q+1+s-1])]
     end
end 
stry
for k=1:1:numel(v)
    Lx=v{k};
    pc=numel(Lx)/numel(x)*100;
    strx=num2str(x(Lx));
    stry=num2str(y);
    strx(strx==' ')=[];stry(stry==' ')=[];
    n=strfind(stry,strx);
    strdisp=['sample ' strx ' with percentage ' num2str(pc)  '%%has '  num2str(numel(nonzeros(n)))  ' match(es). location in y: ' num2str(n) ];
    sprintf([strdisp '\n'])
end

3.

example

stry =
01000101100010101001
   =
  sample 1100 with percentage 100% has 1 match(es). location in y: 3
   =
  sample 110 with percentage 75% has 1 match(es). location in y: 3
   =
  sample 100 with percentage 75% has 2 match(es). location in y: 4  14
   =
  sample 11 with percentage 50% has 2 match(es). location in y: 3  19
   =
  sample 10 with percentage 50% has 6 match(es). location in y: 4   8  10  12  14  17
=
sample 00 with percentage 50% has 4 match(es). location in y: 1   5   6  15

if you find this answer useful would you please be so kind to mark my answer as Accepted Answer?

To any other reader, please if you find this answer

please click on the thumbs-up vote link

thanks in advance

John BG

<mailto:jgb2012@sky.com jgb2012@sky.com>

regards

John BG

<mailto:jgb2012@sky.com jgb2012@sky.com>

4 Comments
Show 2 older commentsHide 2 older comments

Jan on 16 Mar 2017

The conversion to a string can be omitted:

x2 = [1 1 0 0];
y2 = [1 1 1 0 1 1 0 1];
strfind(y, x(1:3))

John BG on 16 Mar 2017

Edited: John BG on 16 Mar 2017

Hi again this is John BG ( <mailto:jgb2012@sky.com jgb2012@sky.com> )

my script applied to

x = [1 0 1 0 1 1 1];
y = [1 1 0 0 0 0 0 0 0 ];

apply my updated script

1  1  0  0  0  0  0  0  0
ans =
sample 1010111 with percentage 100% has 0 match(es). location in y: 
ans =
sample 101011 with percentage 85.7143% has 0 match(es). location in y: 
ans =
sample 010111 with percentage 85.7143% has 0 match(es). location in y: 
ans =
sample 10101 with percentage 71.4286% has 0 match(es). location in y: 
ans =
sample 01011 with percentage 71.4286% has 0 match(es). location in y: 
ans =
sample 10111 with percentage 71.4286% has 0 match(es). location in y: 
ans =
sample 1010 with percentage 57.1429% has 0 match(es). location in y: 
ans =
sample 0101 with percentage 57.1429% has 0 match(es). location in y: 
ans =
sample 1011 with percentage 57.1429% has 0 match(es). location in y: 
ans =
sample 0111 with percentage 57.1429% has 0 match(es). location in y: 
ans =
sample 101 with percentage 42.8571% has 0 match(es). location in y: 
ans =
sample 010 with percentage 42.8571% has 0 match(es). location in y: 
ans =
sample 101 with percentage 42.8571% has 0 match(es). location in y: 
ans =
sample 011 with percentage 42.8571% has 0 match(es). location in y: 
ans =
sample 111 with percentage 42.8571% has 0 match(es). location in y: 
ans =
sample 10 with percentage 28.5714% has 1 match(es). location in y: 2
ans =
sample 01 with percentage 28.5714% has 0 match(es). location in y: 
ans =
sample 10 with percentage 28.5714% has 1 match(es). location in y: 2
ans =
sample 01 with percentage 28.5714% has 0 match(es). location in y: 
ans =
sample 11 with percentage 28.5714% has 1 match(es). location in y: 1
ans =
sample 11 with percentage 28.5714% has 1 match(es). location in y: 1

result:

there are only 2 sub sequences 10 and 11, each with only 1 match, 10 in location 2 and 11 in location 1, and only one match percentage

28.57%

John BG

<mailto:jgb2012@sky.com jgb2012@sky.com>

Sign in to comment.

Answer 3

KSSV on 10 Mar 2017

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/329104-how-to-find-percentage-of-similarity-between-two-arrays#answer_258111

   x=[1 0 1 0] ;
   y=[1 1 1 0]  ;
   idx = x==y ;
   % percentage 
   p = nnz(idx)/numel(x)*100

6 Comments
Show 4 older commentsHide 4 older comments

aditya sahu on 10 Mar 2017

but suppose x=[1 0 1 1] and y= [0 1 1 0 1 1] ,here the result should be 100% as all the first 4 elements of x are matching serially with y starting from 3rd element..

KSSV on 10 Mar 2017

Try that with your code....your code gives 25 as answer if first element matches or exits the loop.

Sign in to comment.

Answer 4

John BG on 12 Mar 2017

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/329104-how-to-find-percentage-of-similarity-between-two-arrays#answer_258388

Hi Aditya

This is John BG ( jgb2012@sky.com )

the following solves this question, and your other question

https://uk.mathworks.com/matlabcentral/answers/329142-how-to-find-longest-common-substring-in-case-of-binary-arrays

that has been closed, not by me, considered duplicate

clear all;clc;close all;
x=[1 1 1];
y=randi([0 1],1,20)
% y= [ 1 1 1 0 1 1 0 1]
maskx=[1:3]
r=conv(x(maskx),y)
n=find(r==max(r))
if max(r)==sum(x(maskx))                    % only sync if peak
     for k=1:1:numel(n)                                            
          sync_position(k)=n(k)                 % sync_position in correlation
          sync_start(k)=sync_position(k)-numel(maskx)+1;     % sync_start in y
          percentage_match(k)=numel(x(maskx))/numel(x)*100
          sync_start=nonzeros(sort(sync_start))
     end
     else
          disp('no match')
end
y

Please let me know if this answer satisfies your question, script_10.m sent by email

regards

John BG

<mailto:jgb2012@sky.com jgb2012@sky.com>

2 Comments
Show NoneHide None

aditya sahu on 14 Mar 2017

Thank you very much, Dear JOHN.

It is a very nice experience to talk with you. and even better feelings sharings things about MatLab with you.

I really appreciate the time and effort you have given to solve my problem. I feel lucky enough myself.

i have tried the attached code(test_10.m) ..It is an improved code compared to previous one. but i feel sorry to inform you that it is also working for some set of values of x array and y array.

For example if i take. the value of, x=[1 1 0 0]; y= [ 1 1 1 0 1 1 0 1]; and maskx=[1:4]; here the code gives, percentage_match = 100%, where as it should be 75% only.bcz the longest substring for x matching with y array is 1 1 0 . i.e 3 digits are matching out of total 4 digits.

Once again from the bottom of my heart i felt really wonderful for your help.

John BG on 14 Mar 2017

Aditya

you are right, with a for loop some chains go missing.

It's even easier than using a for loop:

clear all
x2=[1 1 0 0];             % pattern
y2=[1 1 1 0 1 1 0 1]      % signal     
% y=randi([0 1],1,20)
maskx=[1:3]                % assign percentage here
                           % but only continuous bits, ok?
x=num2str(x2(maskx));
y=num2str(y2);
x(x==' ')=[];y(y==' ')=[];
n=strfind(x,y)

Please confirm

Regards

John BG

<mailto:jgb2012@sky.com jgb2012@sky.com>

Sign in to comment.

Answer 5

Image Analyst on 19 Mar 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/329104-how-to-find-percentage-of-similarity-between-two-arrays#answer_259395

I'm not sure what the question is because, after reading most of the replies, it seems that aditya's been changing it (specifically the sample data and size of the sample data), but one measure of similarity is the Sørensen–Dice coefficient.

https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

When it's applied to binary images, I believe it requires the images to be of the same size.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

how to find percentage of similarity between two arrays.

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

22 Comments
Show 20 older commentsHide 20 older comments

More Answers (4)

4 Comments
Show 2 older commentsHide 2 older comments

6 Comments
Show 4 older commentsHide 4 older comments

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

how to find percentage of similarity between two arrays.

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

22 Comments Show 20 older commentsHide 20 older comments

More Answers (4)

4 Comments Show 2 older commentsHide 2 older comments

6 Comments Show 4 older commentsHide 4 older comments

2 Comments Show NoneHide None

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

22 Comments
Show 20 older commentsHide 20 older comments

4 Comments
Show 2 older commentsHide 2 older comments

6 Comments
Show 4 older commentsHide 4 older comments

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments