This paper contains some observations that can be gleaned from the Peg Solitare contest result dataset. It was inspired by the now finished Data Visualization contest. To be clear, this is not an entry in the contest, since it is too late.
Most of the entrants concentrated on analyzing data closely associated with authors. Here I consider some other data to demonstrate a few analysis and visualization methods.
You can reproduce this paper by running the companion script file vizcon.m with the data file contest_data.mat in the same directory. I set Include Code in the Publish Configuration to false so that the flow is not interrupted. You can set it to true so that the code will be interspersed with the markup.
Extract some variables of obvious interest (score, result, cpu_time) from the structure.
The Workspace Browser shows NaN values for cpu_time and result as well as Inf values for cpu_time.
According the the contest rules, run times greater than 180 resulted in rejection, so these nonfinite values probably are caused by cpu_time NaN values.
We can see this in the distribution of times.
Interestingly the rejection rate (fraction of entries with run time greater than 180) is approximately constant over the contest duration. The rejected entries would need special treatment for further analysis. For now discard them and proceed with the valid entries.
Most entries had almost the same score. Note that the lowest score is the winner!
The form of the scoring equation is provided in the contest rules. Let's look at how the computed score depends on the terms for the actual data. We see that the score depends strongly on result, somewhat on CPU time and very little on complexity.
Using brushing and linking we can see that the high scores with low results are caused by their large CPU times. The usual way to brush is by hand. This plot could be produced programmatically using brush and linkdata. I'll spare you the details.
We can look into points very near the best score to see how score depends on the combination of result and cpu_time. This plot only shows data for scores less than 3950.
Many of the entries had similar CPU times and very similar result numbers. A modest change to the result and cpu_time coefficients in the scoring equation could have changed the finish order.
It is interesting to see if other parameters in the dataset are associated with low scoring entries.
Most entries had a smaller number of characters than the winner. It looks like terseness may not be such a good idea.
Many of the entries had a smaller number of messages than the winner. It appears that some M-Lint warnings are not so urgent.
We already saw that score does not depend very much on complexity. The result value also does not depend on complexity. The cpu_time depends weakly if at all on complexity. Aside from these performance related observations, code complexity can have a big impact on readability, which is quite important in real applications.
Daytime entries typically have better scores than than night entries. Of course it's always daytime somewhere.
Median ambient light scores Darkness: 5.2 Twilight: 4.6 Daylight: 4.0
We can see this more completely by looking at the distributions of entries in a quantile plot. The scores of the darkness entries are always the worst of the three. Twilight scores are usually worse than daytime scores.
I have included 2 figures that are not generated by this m-file. This is against the contest rules, but then I am not entering the contest.
This file does not have a traditional H1 line worded for use with lookfor. Publishing calls for a title, so get used to using lookfor -all. Similarly initial comments that publish well may look different from the help block that we are used to. They should still contain the required information and be placed on contiguous % lines, but the arrangement may be different.