No BSD License  

Highlights from
Code overlap

image thumbnail

Code overlap

by

 

The aim here is to visualize the degree of overlap between all pairs of entries in the contest.

overlap

Contents

Data visualization contest: Code overlap

The aim here is to try and visualize the degree of overlap between all pairs of entries.

load contest_data
n = length([d.id]);
t=[d.timestamp];
nlines=length(allLineList);
% cleanup author names
cleannames = regexprep(lower({d.author}), '_|\.|&| ', '')';
% author labels
lbls = strcat(cleannames,textwrap({sprintf(':%.5d',[d.id])}, 6));
% % contest phases
% [v,ind]=sort(find([d.twilight]));
% twilight=v(ind([1;end]));
% [v,ind]=sort(find([d.daylight]));
% daylight=v(ind([1;end]));

% unique authors
cn=unique(cleannames);
% number of unique authors
nc=length(cn);
% index of each authors entries
nl=zeros(n,1);
for ii=1:nc
    nl(strmatch(cn{ii},cleannames,'exact'))=ii;
end

Quantify code overlap

The overlap coefficient between a pair of entries is the number of shared lines divided by the minimum of the two entry lengths. This yields a symmetric measure that ranges from 0 to 1. The pairwise overlap matrix is trivial to compute but time-consuming, so I've included a low precision pairwise overlap matrix.

if ~exist('pwoverlap.mat', 'file')
    % pw overlap coef. is symmetric and diag is 1, so save tril
    ov = zeros(n*(n-1)/2, 1, 'single');
    ctr=0;
    for ii=1:n-1
        d1 = d(ii).lines;
        nd1=length(d1);
        for jj=ii+1:n
            ctr=ctr+1;
            d2 = d(jj).lines;
            ov(ctr) = length(intersect(d1,d2)) / min(d1, length(d2));
        end
    end
    ov(isnan(ov))=0;
    save ('pwoverlap.mat','ov')
else
    load pwoverlap.mat
    % included file is a low res version saved as a 2 byte char
    if isequal(class(ov),'char')
        % scale back to float
        ov = single(ov/100);
    end
end
ovmat=squareform(ov);

A Pairwise overlap heatmap

A heatmap of the pairwise overlap coefficients arranged by timestamps reveals the extent to which the entries were similar. There is minimal overlap during the darkness and twilight phases. During daylight the overlap is usually restricted to a single day. The entries of 1000 character challenge [5/14 to 5/15] are quite distinct from the rest.

figure
imagesc(squareform(ov));
% set axis ticks for 12 noon of each day
dv = datevec(t(1));
ds = cell(8,1);
dind = zeros(8,1);
day0 = dv(3)-1;
for ii=1:8;
    dv(3) = day0+ii;
    dind(ii) = find(t<=datenum(dv), 1, 'last' );
    ds{ii} = datestr(dv,'mm/dd');
end
axis xy
set(gca,'tickdir','out','xtick',dind, 'xticklabel', ds, 'ytick',dind, 'yticklabel', ds, 'fontweight', 'bold')
xlabel('submission time');
ylabel('submission time');
title('Code overlap', 'fontsize', 12, 'fontweight', 'bold')

Visualize a single authors' results

Display the overlap between entries for a single author.

authname ='yicao';
idx = find(nl==strmatch(authname, cn));
nidx=length(idx);
col_ord = repmat(idx,1,nidx)+repmat((idx-1)',nidx,1)*n;
x=ovmat(col_ord);
figure
imagesc(x);
title(sprintf ('Code overlap for %s',authname), 'fontsize',12, 'fontweight', 'bold')
ds=datestr(t(idx(get(gca,'xtick'))),'mm/dd');
axis xy
set(gca, 'tickdir', 'out', 'xticklabel',ds, 'yticklabel', ds, 'fontweight','bold')
xlabel('submission time')
ylabel('submission time')

Network analysis

Use the bioinformatics toolbox to graph the highly overlapping entries. Visualizing large networks is one area where I find Matlab lacking and need to resort to third-party tools. Perhaps we can see better network visualiztion tools in the future?

% value at which edges are considered significant
ov_cutoff=0.7;

if ~isempty(ver('bioinfo'))
    xcm=x.*(x>ov_cutoff);
    bg=biograph(xcm,lbls(idx),'LayoutType','radial','showarrows','off');
    view(bg)
else
    error('Requires the Bioinformatics Toolbox')
end

Contact us