Recode categorical vectors for best match?

1 view (last 30 days)

Bob Fredricks on 24 Nov 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/257319-recode-categorical-vectors-for-best-match

Edited: Bob Fredricks on 24 Nov 2015

I have several vectors that categorize observations where the names are arbitrary within each vector, meaning that it doesn't matter that Group 1 is called Group 1 and Group 2 is called Group 2, just that there are two groups. The problem is that I need to relabel the categories within each vector so that the category each observation is placed in is the same across vectors. So it matters that that there are two possible groups in each vector, and it matters that observations placed in Group 1 in vector 1 are also placed in Group 1 in vector 2.

Example: Suppose I have the following vectors that each place four observations into one of three categories.

V1 = [1 2 1 3]; V2 = [3 1 3 2]

In this example there I can fix the categories of V1 (since they are arbitrary), and then there are six possible relabeling for V2.

Relabeling 1: 1->1, 2->2, 3->3
Relabeling 2: 1->1, 2->3, 3->2
Relabeling 3: 1->2, 2->1, 3->3
Relabeling 4: 1->2, 2->3, 3->1
Relabeling 5: 1->3, 2->2, 3->1
Relabeling 6: 1->3, 2->1, 3->2

After testing each relabeling I would find that "Relabeling 5" is the best match, since it produces

V2' = [3->1, 1->2, 3->1, 2->1 ] = [1,2,1,3] = V1

In my actual data, the vectors won't match perfectly, but I want to find the relabeling/permutation that maximizes the percentage of the time the vectors match, e.g.

Criteria = [#observations where V1=V2]/[#total observations]

The problem is that evaluating every possible relabeling quickly becomes unfeasible as the number of categories and vectors increases. I'm guessing there's no way to find the true best, but I'm wondering if there's an algorithm that can find an approximately good relabeling. In my data, with the proper relabeling, the vectors should match over 90 percent of the time, which means it should be possible to throw out a large number of possible relabelings without actually testing them.

I've tried a few things with moderate success. The best success I've had so far is an iterative approach of pairwise matching to modes

Step 1: Compute Vmode = mode(V1,...,VN)
Step 2: Relabel each vector V1,...,VN one at a time to find best match with Vmode
Step 3: Repeat steps 1-2 until no improvement in Criteria

However it's far from perfect. Does anybody have any suggestions or insight on how to approach this problem?

0 Comments
Show -2 older commentsHide -2 older comments

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Recode categorical vectors for best match?

0 Comments
Show -2 older commentsHide -2 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Recode categorical vectors for best match?

0 Comments Show -2 older commentsHide -2 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments