Recode categorical vectors for best match?

1 view (last 30 days)
Bob Fredricks
Bob Fredricks on 24 Nov 2015
Edited: Bob Fredricks on 24 Nov 2015
I have several vectors that categorize observations where the names are arbitrary within each vector, meaning that it doesn't matter that Group 1 is called Group 1 and Group 2 is called Group 2, just that there are two groups. The problem is that I need to relabel the categories within each vector so that the category each observation is placed in is the same across vectors. So it matters that that there are two possible groups in each vector, and it matters that observations placed in Group 1 in vector 1 are also placed in Group 1 in vector 2.
Example: Suppose I have the following vectors that each place four observations into one of three categories.
V1 = [1 2 1 3]; V2 = [3 1 3 2]
In this example there I can fix the categories of V1 (since they are arbitrary), and then there are six possible relabeling for V2.
Relabeling 1: 1->1, 2->2, 3->3
Relabeling 2: 1->1, 2->3, 3->2
Relabeling 3: 1->2, 2->1, 3->3
Relabeling 4: 1->2, 2->3, 3->1
Relabeling 5: 1->3, 2->2, 3->1
Relabeling 6: 1->3, 2->1, 3->2
After testing each relabeling I would find that "Relabeling 5" is the best match, since it produces
V2' = [3->1, 1->2, 3->1, 2->1 ] = [1,2,1,3] = V1
In my actual data, the vectors won't match perfectly, but I want to find the relabeling/permutation that maximizes the percentage of the time the vectors match, e.g.
Criteria = [#observations where V1=V2]/[#total observations]
The problem is that evaluating every possible relabeling quickly becomes unfeasible as the number of categories and vectors increases. I'm guessing there's no way to find the true best, but I'm wondering if there's an algorithm that can find an approximately good relabeling. In my data, with the proper relabeling, the vectors should match over 90 percent of the time, which means it should be possible to throw out a large number of possible relabelings without actually testing them.
I've tried a few things with moderate success. The best success I've had so far is an iterative approach of pairwise matching to modes
Step 1: Compute Vmode = mode(V1,...,VN)
Step 2: Relabel each vector V1,...,VN one at a time to find best match with Vmode
Step 3: Repeat steps 1-2 until no improvement in Criteria
However it's far from perfect. Does anybody have any suggestions or insight on how to approach this problem?

Answers (0)

Categories

Find more on Discrete Math in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!