Asked by curran
on 16 Nov 2012

I have a random array of 10000 7-letter strings. I have a simple algorithm that assesses how similar each element is to all others. For example, SESTINA and SESTINE would have a high similarity, but ZYZZYVA and NEROLIS would have low similarity. The result is a 10000x10000 symetric matrix.

I want to group/sort the strings based on similarity, where proximity of similar strings is important and order is irrelevant. The higher the similarity of two strings, the more important it is for them to be together. I've imagined some sort of least squares technique?

Anyone have a clue for me?

EDIT: The end goal is a 1d array, and not looking for a graphic solution.

*No products are associated with this question.*

Answer by Matt J
on 18 Nov 2012

Edited by Matt J
on 18 Nov 2012

Accepted answer

Your problem looks like a variation of the Traveling Salesman Problem where the strings play the role of the "cities" with distances between them defined by your 10000x10000 similarity matrix. The TSP is NP-hard, so I don't know how much hope you have of solving it rapidly for 10000 cities. However, there are a number of different TSP solvers offered on the FEX which you could look at. The following one looks more relevant to you than some since it doesn't require a return to the starting point

There are probably also some hard-core optimized non-MATLAB TSP sovers out there somewhere.

Answer by Matt J
on 18 Nov 2012

Edited by Matt J
on 18 Nov 2012

Another thing that might help is to try to optimize the way you initialize your current algorithm. If anagrams have lower separation distances than substitutions, for example, I would expect that doing a dictionary sort, but one which groups anagrams together, would provide a pretty good initial guess, e.g.,

X=['AA';'AB';'AC';'BA';'BB';'BC';'CA';'CB';'CC']; [~,idx]=sortrows(sort(X,2));

Xinitial=X(idx,:)

Xinitial =

AA AB BA AC CA BB BC CB CC

Opportunities for recent engineering grads.

## 8 Comments

## Matt Fig (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111581

How do you quantify similarity? In order to proceed we have to know what we are working with.

## curran (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111612

Its based on anagrams, transpositions etc. I can do that and I can populate the "similarity matrix", its what comes after that is the issue.

## curran (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111614

for example, the array:

TOFF BIFF RIFF RIFT SIFT FIST BOFF CRAP FITS

Would be arranged

FIST FITS SIFT RIFT RIFF BIFF BOFF TOFF CRAP

This arrangement would minimize the sumi (every word) of sumj (every other word for i) of similarity*distance. I was imagining a computational nightmare that would see two subarrays switched and check if this value is less. Sloppy but it would get the job done. Nothing better?

## Matt J (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111636

How would you order this

AA AB AC BA BB BC CA CB CC

## Matt Fig (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111648

curran, you still have not given us any clue as to what the similarity matrix looks like. You showed the initial list of strings and the final list, but the crucial thing we need in order to help you is to see what the similarity matrix looks like.

So, given your initial and final strings posted above, what did the similarity matrix look like?

## curran (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111739

I implemented something that took 12 hours to compute on a 1000 element array, and my method was so sloppy it put similar stuff together somewhat... closer?

Matt J, your sample array gives very high degree of similarity, making it difficult to calculate manually, but since ive put anagrams as a higher similarity than substitutions, it would hazard: AA AB BA CA CC AC BC CB BB

Matt Fig, the similarity matrix for my previous array example (minding the symmetric nature) would be: http://s16.postimage.org/aqxicb1qd/simmat.png

My current method (which would take 50 days for my full array) makes a change to the order of the array (switches two adjacent elements). If this switch converges (has a lower dissimilarity coefficient), the move is kept, and the similarity matrix is reshuffled according to the new order. Once no adjacent switches converge in an entire pass of the array, it is deamed the array is fully converged. Its closer, but still not quite. I'm thinking this could be 'sped up' by swapping strings 100 elements apart then decreaasing the spacing. I don't know how the would affect convergence.

## Matt J (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111782

Matt J, your sample array gives very high degree of similarity, making it difficult to calculate manually, but since ive put anagrams as a higher similarity than substitutions, it would hazard: AA AB BA CA CC AC BC CB BBI would have expected AA BA BB AB AC CA CB BC CC. If I assume anagrams have zero distance between them and substitutions have a distance given by their alphabetic separation, the above sequence has a length of 6 whereas yours has a length of 9.

## curran (view profile)

Direct link to this comment:http://www.mathworks.com/matlabcentral/answers/53951#comment_111808

Okay... the way I see it, valuing anagrams twice as much as substitutions, mine has an error of:

1*1+1*2+1*4 + 1*3+1*5+1*6 + 1*4+1*5 + 2*1+1*3 + 1*1+1*2 + 1*1 = 39

Yours has an error of:

2+3+4 + 2+3+5 + 3+4 + 2 + 2+3 + 2 + 1 = 36

Which makes you smarter than me :p

An interesting problem... its like taking a bunch of marbles that are randomly connected by elastics of different lengths and elasticities and trying to push them into a line. How do you decrease the stresses as much as possible...