Such a thing as relationship sorting?

4 views (last 30 days)
I have a random array of 10000 7-letter strings. I have a simple algorithm that assesses how similar each element is to all others. For example, SESTINA and SESTINE would have a high similarity, but ZYZZYVA and NEROLIS would have low similarity. The result is a 10000x10000 symetric matrix.
I want to group/sort the strings based on similarity, where proximity of similar strings is important and order is irrelevant. The higher the similarity of two strings, the more important it is for them to be together. I've imagined some sort of least squares technique?
Anyone have a clue for me?
EDIT: The end goal is a 1d array, and not looking for a graphic solution.
  8 Comments
Matt J
Matt J on 18 Nov 2012
Edited: Matt J on 18 Nov 2012
Matt J, your sample array gives very high degree of similarity, making it difficult to calculate manually, but since ive put anagrams as a higher similarity than substitutions, it would hazard: AA AB BA CA CC AC BC CB BB
I would have expected AA BA BB AB AC CA CB BC CC. If I assume anagrams have zero distance between them and substitutions have a distance given by their alphabetic separation, the above sequence has a length of 6 whereas yours has a length of 9.
curran
curran on 18 Nov 2012
Okay... the way I see it, valuing anagrams twice as much as substitutions, mine has an error of:
1*1+1*2+1*4 + 1*3+1*5+1*6 + 1*4+1*5 + 2*1+1*3 + 1*1+1*2 + 1*1 = 39
Yours has an error of:
2+3+4 + 2+3+5 + 3+4 + 2 + 2+3 + 2 + 1 = 36
Which makes you smarter than me :p
An interesting problem... its like taking a bunch of marbles that are randomly connected by elastics of different lengths and elasticities and trying to push them into a line. How do you decrease the stresses as much as possible...

Sign in to comment.

Accepted Answer

Matt J
Matt J on 18 Nov 2012
Edited: Matt J on 18 Nov 2012
Your problem looks like a variation of the Traveling Salesman Problem where the strings play the role of the "cities" with distances between them defined by your 10000x10000 similarity matrix. The TSP is NP-hard, so I don't know how much hope you have of solving it rapidly for 10000 cities. However, there are a number of different TSP solvers offered on the FEX which you could look at. The following one looks more relevant to you than some since it doesn't require a return to the starting point
There are probably also some hard-core optimized non-MATLAB TSP sovers out there somewhere.
  1 Comment
curran
curran on 18 Nov 2012
That indeed appears to be directly applicable to my problem! Thanks for all your help.

Sign in to comment.

More Answers (1)

Matt J
Matt J on 18 Nov 2012
Edited: Matt J on 18 Nov 2012
Another thing that might help is to try to optimize the way you initialize your current algorithm. If anagrams have lower separation distances than substitutions, for example, I would expect that doing a dictionary sort, but one which groups anagrams together, would provide a pretty good initial guess, e.g.,
X=['AA';'AB';'AC';'BA';'BB';'BC';'CA';'CB';'CC'];
[~,idx]=sortrows(sort(X,2));
Xinitial=X(idx,:)
Xinitial =
AA
AB
BA
AC
CA
BB
BC
CB
CC

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!