My problem involves calibrating a numerical model which predicts some event which happens or not in each year. It could be economic events, coral bleaching, or many other things. I want to compare the similarity of results from different model versions, or with real-world historical data.
The models are expected to miss quite often, so looking for exact matches won't do. Size of error matters so Wilcoxson rank-sum won't do. The lists will often be different in length, and they could be quite a bit longer than my examples below.
Examples of what is subjectively "good" and "bad".
A = [1968 1972 1991 1993 2001 2010]
B = [1968 1972 1993 2001 2010]
C = [1969 1973 1991 1995 2001 2011]
D = [1950 1960 1991 1993 2001 2050]
E = [1968 1972 1991 1993 2001 2010 2050]
Consider A to be "correct"
B is missing one year entirely, but this is not disastrous.
C has only two matching values, but the others are close, I'd call this better than B.
D has three exact matches, but the others are way off. I'd consider this the worst.
E has five exact matches and one really bad point. Again, not disastrous.
Of course I don't expect an algorithm to match my subjective evaluation all the time. I just want it to take the things I have mentioned into account.
If I were to make up an algorithm off the cuff I'd probably try to for look points with near neighbors in the other list and score their distances root-mean-square style, with some maximum value counted against any points left with no neighbor. This is really crude, and there must be a better way.