File Exchange

image thumbnail

String subsequence tools

version 1.1.0.0 (4.97 KB) by John D'Errico
Identify common substrings of a pair of strings

3 Downloads

Updated 01 Jun 2010

View License

Zhiping XU's submission on the FEX intrigued me. I knew it had to be doable more efficiently. Long strings will be common, so it makes much sense to have an efficient code. You might find these tools interesting for inspecting strings of DNA bases, or for checking a student's homework submission for plagiarized content. Surely there are other uses too.

The commonsubstring.m function does this search fairly efficiently (though I am sure it too can be enhanced.)

Generate a pair of long random letter sequences, then determine the longest common substring between them. In the following example, the original strings each had 10^5 random elements in them.

bases = 'acgt';
str1 = bases(ceil(rand(1,100000)*4));
str2 = bases(ceil(rand(1,100000)*4));

tic,[substr,ind1,ind2] = commonsubstring(str1,str2);toc
Elapsed time is 16.650532 seconds.

There were two substrings found of the maximum length (16) characters. These substrings started at locations 22189 and 74425 in str1, and at locations 64948 and 32833 in str2.

substr =
gctttagggcgtacgc
cttcggataccttgtt

ind1 =
[22189]
[74425]

ind2 =
[64948]
[32833]

For a second example, commonsubstring can find all common substrings of a given fixed length.

str1 = char('a' + round(rand(1,100)*1.5))
str1 =
bbbabbbbbabbbbbbabbbbbabbababbbabbabbbbbbabbbbbbbbbbbbbabbaaabbbbbaabbbbbbbbbbabbbbbaaabbabbaabbbbbb

str2 = 'aaabbabbb';

substr,ind1,ind2] = commonsubstring(str1,str2,3)
substr =
aaa
aab
abb
bab
bba
bbb

ind1 =
[1x2 double]
[1x4 double]
[1x15 double]
[1x11 double]
[1x15 double]
[1x19 double]

ind2 =
[ 1]
[ 2]
[1x2 double]
[ 5]
[ 4]
[ 7]

In addition, I've also included the function substrings.m. It returns a list of all substrings of a given string. In the example below, it returns all distinct substrings of length 3 from str1 above.

substrings(str1,3,1)
ans =
aaa
aab
aba
abb
baa
bab
bba
bbb

Cite As

John D'Errico (2021). String subsequence tools (https://www.mathworks.com/matlabcentral/fileexchange/27460-string-subsequence-tools), MATLAB Central File Exchange. Retrieved .

Comments and Ratings (10)

Stephen Cobeldick

@John D'Errico: these functions always worked well for me (many thanks!), but I just noticed that |commonsubstring| does not handle regular expression special characters. As far as I can tell, this can be resolved by replacing every instance of
regexp(..., cellsubstr)
with
regexp(..., regexptranslate('escape',cellsubstr))

See example: https://www.mathworks.com/matlabcentral/answers/511237-how-to-get-common-sub-string-in-cell-array-string

John D'Errico

Since this tool works on strings...

x=[1 0 1 0];
y= [0 1 1 0 1 1];

[substr,loc1,loc2] = commonsubstring(char(x+'0'),char(y+'0'))
substr =
101
loc1 =
cell
[1]
loc2 =
cell
[3]

aditya sahu

Hello, dear all hope you all are doing well.
I need your assistance for solving a problem. I want to find longest common substring in case of two arrays with binary numbers.
Let, x=[1 0 1 0], y= [0 1 1 0 1 1]
then 1 0 1 of x are same in 1 0 1 of y in 3rd,4th,5th, positoin. So i need the perecentage of x are matching so here it is,75%,and also the index of y where it started matching ,in this case the starting index is 3.

Maher HAMDI

John D'Errico

The new version has been installed.

John D'Errico

It seems to work properly from this test:

str1='a a';
str2='abb a abb ag';

commonsubstring(str1,str2)
ans =
a a

I would recommend downloading the new version. If necessary, send me an e-mail, and I can send it to you if you need it immediately. I don't see that they have installed the new version on the page as yet.

zyxuwv

Dear John D'Errico
The old version does not work well using str1="a a" and str2="abb a abb ag". The space seems to be the problem. Is that bug you have corrected in the new version?
zyxuwv

zyxuwv

Dear John D'Errico
The loop you suggested applies very well to my propose.
Thank you.
zyxuvw

John D'Errico

(By the way, I found a bug in the code, and just submitted a new version. That new version will be accessible in a couple of hours.)

I'm not sure what is your goal here. If your goal is to find the longest substring that str1 has in common with each of the cells in turn, then just use cellfun.

For example, define str2 as the lines of this W.H. Auden poem.

str2 = lower({'Perfection, of a kind, was what he was after,'
'And the poetry he invented was easy to understand;'
'He knew human folly like the back of his hand,'
'And was greatly interested in armies and fleets;'
'When he laughed, respectable senators burst with laughter,'
'And when he cried the little children died in the streets.'});

I've used lower here, since my tests want to be insensitive to case. Now, I'll choose an unrelated test string as a comparator. Since my poetic instincts are far less capable than those of W.H. Auden, I'll try this imitation of a chimp at a keyboard:

str1 = 'xfgsvswwqewfgfjylittkjbhkbkjdfbdklklvmlandkl';

A loop will help us, or cellfun if you want to do it in an automatic fashion. In practice, I would use cellfun to generate an implicit loop. Here, for the sake of simplicity and readability, I'll expand the loop:

>> commonsubstring(str1,str2{1})
ans =
nd

>> commonsubstring(str1,str2{2})
ans =
and

>> commonsubstring(str1,str2{3})
ans =
and

>> commonsubstring(str1,str2{4})
ans =
and

>> commonsubstring(str1,str2{5})
ans =
la
it

>> commonsubstring(str1,str2{6})
ans =
litt

We can see that the longest common substring over any of the lines was 'litt'. We could find that substring by concatenating the strings together into ONE string, with some separator character between the pieces. Choose a separator that never appears in str1. Again, I could do this in an efficient programmatic way, but for now I'll do it the lazy way.

>> commonsubstring(str1,[str2{1},'$',str2{2},'$', ...
str2{3},'$',str2{4},'$',str2{5},'$',str2{6}])

ans =
litt

If your goal is to find the longest common string that appears in every one of the strings of str2 at SOME various location, then this tool is not written to serve that purpose. You could force it to serve that purpose with some effort however. For example, first, see that str1 has common substrings of length 2 with str2{1} and str2{5}, but that there are no intersections between them. Therefore, the longest common substring overall that appears in every cell of str2 must be of length no longer than 1. Since common substring can be used to find common substrings of a given fixed length, we could now do this, and then find the intersection of those results, if any exists.

zyxuwv

Almost what i needed! I am wondering if there is a easy way to allow str2 to be a string cell ?
Thanks.

MATLAB Release Compatibility
Created with R2010a
Compatible with any release
Platform Compatibility
Windows macOS Linux
Tags Add Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!