Finding Likely Duplicate Strings

2 views (last 30 days)
Jason
Jason on 12 Mar 2014
I have an existing database of contact information for various contacts at specified offices across the country (a "lead" list if you will). This database contains information such as first name, last name, etc. In an effort to refresh the database with current information, I have done some manual research and data logging and have compiled a new, separate data set of current contact information for contacts at the same specified offices.
When updating the existing database with the new data, I've noticed that I'm creating "duplicate" contact records quite a bit. The updating algorithm simply looks for an exact match when it references the contact's name in the new, current data set against the contact's name in the old, existing database. The algorithm thinks "Gregory Smith" is not currently in the database because there isn't an exact match, but upon closer inspection "Gregory" IS already in the database as "Greg Smith".
Instead of manually looking through the database as I update the data and "de-duping" things myself, I was wondering if there was a Matlab function that can compare 2 strings and return how likely it is that they're the same. For example, having the computer flag "Gregory Smith" when the database currently has "Greg Smith" in it. Having the computer do this type of preprocessing would save a lot of time. Any help would be greatly appreciated. Thanks.

Answers (1)

Jan
Jan on 12 Mar 2014

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!