How to searh for very similar strings?

Asked by pietro
on 25 Jan 2019
Latest activity Commented on by pietro
on 26 Jan 2019
Hi all,
I am doing a bibliometric analysis and especially, I have to search article titles on references of the citing papers. Here, you can see my code:
for iMS=1:length(MS)
The code works pretty well, however the data that I can export from Scopus is not perfect. Indeed, article names are not consistent, so the perfect match does not always work. Here two examples:
Case 1:
Real article name: 'Biomethane production from different crop systems of cereals in Northern Italy'
Article name in the reference: 'Biomethane production from different crop systems of cereals in Nothern Italy'
Case 2:
Real article name: 'Methodology for the realisation of accelerated structural tests on tractors'
Article name in the reference: 'Methodology for the realization of accelerated structural tests on tractors'
As you can see, the two titles differ of a tiny character. Due to the fact that I have more than 20000 papers and fixing it by hand can be time-consuming, is there any way to programmatically search for very similar strings? As you can see, the strings might change also in length.
1 Answer

Answer by John D'Errico
on 25 Jan 2019
Edited by John D'Errico
on 25 Jan 2019
 Accepted Answer


As the question indicates the wish for case-insensitive matching, here is the Wagner–Fischer algorithm with strcmpi:
Hi all,
thanks for your precious feedbacks. MSCit is a struct of record of 21'000 records and MS is a struct with 2000 records. Each 'Referece' field of MSCit is composed of about 10'000 characters, while the 'Title' record of MS is composed of about 100 characters. So, I have thought to use a fuzzy search approach, that works, but I have to use a double-for (like the code below), so the computation time is very long.
for iCit=1:length(MSCit)
[d A] = fzsearch(lower(MSCit(iCit).References),lower(MS(iMS).Title));
if d<3
ProvaCit=[ProvaCit, iCit];
I have tought to do the following
[d A] = fzsearch({lower(MSCit(iCit).References)},lower(MS(iMS).Title));
but no real change. how could I speed-up the code? I thought to use a more stable parameter to limit the call of fzsearch. So, I tried to search for articles with similar authorships in the references with contains and then use fzsearch only in these articles. However, niether the author names are consistent. For example, I have found 'González' e 'Gonzalez'. Is there any easy and fast way to deal with this type of situation?

