version 1.1.0.0 (4.97 KB) by
John D'Errico

Identify common substrings of a pair of strings

Zhiping XU's submission on the FEX intrigued me. I knew it had to be doable more efficiently. Long strings will be common, so it makes much sense to have an efficient code. You might find these tools interesting for inspecting strings of DNA bases, or for checking a student's homework submission for plagiarized content. Surely there are other uses too.

The commonsubstring.m function does this search fairly efficiently (though I am sure it too can be enhanced.)

Generate a pair of long random letter sequences, then determine the longest common substring between them. In the following example, the original strings each had 10^5 random elements in them.

bases = 'acgt';

str1 = bases(ceil(rand(1,100000)*4));

str2 = bases(ceil(rand(1,100000)*4));

tic,[substr,ind1,ind2] = commonsubstring(str1,str2);toc

Elapsed time is 16.650532 seconds.

There were two substrings found of the maximum length (16) characters. These substrings started at locations 22189 and 74425 in str1, and at locations 64948 and 32833 in str2.

substr =

gctttagggcgtacgc

cttcggataccttgtt

ind1 =

[22189]

[74425]

ind2 =

[64948]

[32833]

For a second example, commonsubstring can find all common substrings of a given fixed length.

str1 = char('a' + round(rand(1,100)*1.5))

str1 =

bbbabbbbbabbbbbbabbbbbabbababbbabbabbbbbbabbbbbbbbbbbbbabbaaabbbbbaabbbbbbbbbbabbbbbaaabbabbaabbbbbb

str2 = 'aaabbabbb';

substr,ind1,ind2] = commonsubstring(str1,str2,3)

substr =

aaa

aab

abb

bab

bba

bbb

ind1 =

[1x2 double]

[1x4 double]

[1x15 double]

[1x11 double]

[1x15 double]

[1x19 double]

ind2 =

[ 1]

[ 2]

[1x2 double]

[ 5]

[ 4]

[ 7]

In addition, I've also included the function substrings.m. It returns a list of all substrings of a given string. In the example below, it returns all distinct substrings of length 3 from str1 above.

substrings(str1,3,1)

ans =

aaa

aab

aba

abb

baa

bab

bba

bbb

John D'Errico (2021). String subsequence tools (https://www.mathworks.com/matlabcentral/fileexchange/27460-string-subsequence-tools), MATLAB Central File Exchange. Retrieved .

Created with
R2010a

Compatible with any release

**Inspired by:**
LCSSubStr-a function that return two string's largest common part

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!Create scripts with code, output, and formatted text in a single executable document.

Stephen Cobeldick@John D'Errico: these functions always worked well for me (many thanks!), but I just noticed that |commonsubstring| does not handle regular expression special characters. As far as I can tell, this can be resolved by replacing every instance of

regexp(..., cellsubstr)

with

regexp(..., regexptranslate('escape',cellsubstr))

See example: https://www.mathworks.com/matlabcentral/answers/511237-how-to-get-common-sub-string-in-cell-array-string

John D'ErricoSince this tool works on strings...

x=[1 0 1 0];

y= [0 1 1 0 1 1];

[substr,loc1,loc2] = commonsubstring(char(x+'0'),char(y+'0'))

substr =

101

loc1 =

cell

[1]

loc2 =

cell

[3]

aditya sahuHello, dear all hope you all are doing well.

I need your assistance for solving a problem. I want to find longest common substring in case of two arrays with binary numbers.

Let, x=[1 0 1 0], y= [0 1 1 0 1 1]

then 1 0 1 of x are same in 1 0 1 of y in 3rd,4th,5th, positoin. So i need the perecentage of x are matching so here it is,75%,and also the index of y where it started matching ,in this case the starting index is 3.

Maher HAMDIJohn D'ErricoThe new version has been installed.

John D'ErricoIt seems to work properly from this test:

str1='a a';

str2='abb a abb ag';

commonsubstring(str1,str2)

ans =

a a

I would recommend downloading the new version. If necessary, send me an e-mail, and I can send it to you if you need it immediately. I don't see that they have installed the new version on the page as yet.

zyxuwvDear John D'Errico

The old version does not work well using str1="a a" and str2="abb a abb ag". The space seems to be the problem. Is that bug you have corrected in the new version?

zyxuwv

zyxuwvDear John D'Errico

The loop you suggested applies very well to my propose.

Thank you.

zyxuvw

John D'Errico(By the way, I found a bug in the code, and just submitted a new version. That new version will be accessible in a couple of hours.)

I'm not sure what is your goal here. If your goal is to find the longest substring that str1 has in common with each of the cells in turn, then just use cellfun.

For example, define str2 as the lines of this W.H. Auden poem.

str2 = lower({'Perfection, of a kind, was what he was after,'

'And the poetry he invented was easy to understand;'

'He knew human folly like the back of his hand,'

'And was greatly interested in armies and fleets;'

'When he laughed, respectable senators burst with laughter,'

'And when he cried the little children died in the streets.'});

I've used lower here, since my tests want to be insensitive to case. Now, I'll choose an unrelated test string as a comparator. Since my poetic instincts are far less capable than those of W.H. Auden, I'll try this imitation of a chimp at a keyboard:

str1 = 'xfgsvswwqewfgfjylittkjbhkbkjdfbdklklvmlandkl';

A loop will help us, or cellfun if you want to do it in an automatic fashion. In practice, I would use cellfun to generate an implicit loop. Here, for the sake of simplicity and readability, I'll expand the loop:

>> commonsubstring(str1,str2{1})

ans =

nd

>> commonsubstring(str1,str2{2})

ans =

and

>> commonsubstring(str1,str2{3})

ans =

and

>> commonsubstring(str1,str2{4})

ans =

and

>> commonsubstring(str1,str2{5})

ans =

la

it

>> commonsubstring(str1,str2{6})

ans =

litt

We can see that the longest common substring over any of the lines was 'litt'. We could find that substring by concatenating the strings together into ONE string, with some separator character between the pieces. Choose a separator that never appears in str1. Again, I could do this in an efficient programmatic way, but for now I'll do it the lazy way.

>> commonsubstring(str1,[str2{1},'$',str2{2},'$', ...

str2{3},'$',str2{4},'$',str2{5},'$',str2{6}])

ans =

litt

If your goal is to find the longest common string that appears in every one of the strings of str2 at SOME various location, then this tool is not written to serve that purpose. You could force it to serve that purpose with some effort however. For example, first, see that str1 has common substrings of length 2 with str2{1} and str2{5}, but that there are no intersections between them. Therefore, the longest common substring overall that appears in every cell of str2 must be of length no longer than 1. Since common substring can be used to find common substrings of a given fixed length, we could now do this, and then find the intersection of those results, if any exists.

zyxuwvAlmost what i needed! I am wondering if there is a easy way to allow str2 to be a string cell ?

Thanks.