MATLAB Answers

Ignore Deletions with Edit Distances (String Editing)

1 view (last 30 days)
Marcel Dorer
Marcel Dorer on 22 Apr 2016
Edited: Arnab Sen on 27 Apr 2016
Hi, I'm trying to compare 2 strings with a function based on Miguel Castro's EditDist.m function. The function works pretty well but in my case I need to ignore some of the Deletions, namely all in the beginning and the end of the string.
For example when I compare the 2 Strings 'XXXXMatlabXXXX' and 'YYMatlabYY' the first 2 'X' and the last 2 'X' which would be deletions shouldn't count towards the EditDistance value (which should be 4 in this case). Basically one of the 2 strings has a random number of random surrounding values that should be ignored, deletions after the first Insertion/Replacement/Correct Value should be counted normally, at least until there is only a tail of deletions left.
Help would be really appreciated!
Here is the relevant part of the function I'm using:
for i = 1:n1
D(i+1,1) = D(i,1) + DelCost;
end;
for j = 1:n2
D(1,j+1) = D(1,j) + InsCost;
end;
for i = 1:n1
for j = 1:n2
if s1(i) == s2(j)
Repl = 0;
else
Repl = ReplCost;
end;
D(i+1,j+1) = min([D(i,j)+Repl D(i+1,j)+DelCost D(i,j+1)+InsCost]);
end;
end;
d = D(n1+1,n2+1);

Answers (1)

Arnab Sen
Arnab Sen on 26 Apr 2016
Edited: Arnab Sen on 27 Apr 2016
Hello Marcel,
I am assuming that between two strings s1 and s2, s1 is known to be the one which is wrapped with some redundant characters.
Now, let's dig into what is meant by D(i,j) in the script. It means that the conversion cost of s1.substring(1,i) to s2.substring(1,j) and vice verse. Now, let's assume that after kth index of s1, all the indices are redundant. So,
D(n1,n2)=D(k,n2)+(n1-k)*DelCost.
So, Now the task is simple. We need to find out the value of k. Following code snippet should do that:
i=n1;
while(D(i,n2)-D(i-1,n2)==DelCost)
{
i=i-1;
}
k=i;
So, the last (n1-k) chars are redundant in s1.
Now we need to find out the front end redundant characters in s1. For this we can create another table (say X) where
X(i,j)= The conversion cost of s1.subtring(i,n1) to s2.sunstring(j,n2) and adopt similar approach.
A simpler approach would be just reverse the string s1 (say s1')and s2 (s2') and call edit distance again and perform same workflow. Now redundant character at the end of s1' are the redundant characters in the front end of the original string s1.
At the end subtracts DelCost*(number of total redundant characters in s1) from the original output.
  2 Comments
Arnab Sen
Arnab Sen on 26 Apr 2016
Hi,
You are correct. MATLAB does not recognize i--. It's common in languages like C, C++, Java. Please consider the expression as
{
i=i-1;
}
I have edited the original answer as well accordingly. Thanks for pointing this out.
Please accept the answer if this helps.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!