Contents
Data visualization contest: Code overlap
The aim here is to try and visualize the degree of overlap between all pairs of entries.
load contest_data n = length([d.id]); t=[d.timestamp]; nlines=length(allLineList); % cleanup author names cleannames = regexprep(lower({d.author}), '_|\.|&| ', '')'; % author labels lbls = strcat(cleannames,textwrap({sprintf(':%.5d',[d.id])}, 6)); % % contest phases % [v,ind]=sort(find([d.twilight])); % twilight=v(ind([1;end])); % [v,ind]=sort(find([d.daylight])); % daylight=v(ind([1;end])); % unique authors cn=unique(cleannames); % number of unique authors nc=length(cn); % index of each authors entries nl=zeros(n,1); for ii=1:nc nl(strmatch(cn{ii},cleannames,'exact'))=ii; end
Quantify code overlap
The overlap coefficient between a pair of entries is the number of shared lines divided by the minimum of the two entry lengths. This yields a symmetric measure that ranges from 0 to 1. The pairwise overlap matrix is trivial to compute but time-consuming, so I've included a low precision pairwise overlap matrix.
if ~exist('pwoverlap.mat', 'file') % pw overlap coef. is symmetric and diag is 1, so save tril ov = zeros(n*(n-1)/2, 1, 'single'); ctr=0; for ii=1:n-1 d1 = d(ii).lines; nd1=length(d1); for jj=ii+1:n ctr=ctr+1; d2 = d(jj).lines; ov(ctr) = length(intersect(d1,d2)) / min(d1, length(d2)); end end ov(isnan(ov))=0; save ('pwoverlap.mat','ov') else load pwoverlap.mat % included file is a low res version saved as a 2 byte char if isequal(class(ov),'char') % scale back to float ov = single(ov/100); end end ovmat=squareform(ov);
A Pairwise overlap heatmap
A heatmap of the pairwise overlap coefficients arranged by timestamps reveals the extent to which the entries were similar. There is minimal overlap during the darkness and twilight phases. During daylight the overlap is usually restricted to a single day. The entries of 1000 character challenge [5/14 to 5/15] are quite distinct from the rest.
figure imagesc(squareform(ov)); % set axis ticks for 12 noon of each day dv = datevec(t(1)); ds = cell(8,1); dind = zeros(8,1); day0 = dv(3)-1; for ii=1:8; dv(3) = day0+ii; dind(ii) = find(t<=datenum(dv), 1, 'last' ); ds{ii} = datestr(dv,'mm/dd'); end axis xy set(gca,'tickdir','out','xtick',dind, 'xticklabel', ds, 'ytick',dind, 'yticklabel', ds, 'fontweight', 'bold') xlabel('submission time'); ylabel('submission time'); title('Code overlap', 'fontsize', 12, 'fontweight', 'bold')
Visualize a single authors' results
Display the overlap between entries for a single author.
authname ='yicao'; idx = find(nl==strmatch(authname, cn)); nidx=length(idx); col_ord = repmat(idx,1,nidx)+repmat((idx-1)',nidx,1)*n; x=ovmat(col_ord); figure imagesc(x); title(sprintf ('Code overlap for %s',authname), 'fontsize',12, 'fontweight', 'bold') ds=datestr(t(idx(get(gca,'xtick'))),'mm/dd'); axis xy set(gca, 'tickdir', 'out', 'xticklabel',ds, 'yticklabel', ds, 'fontweight','bold') xlabel('submission time') ylabel('submission time')
Network analysis
Use the bioinformatics toolbox to graph the highly overlapping entries. Visualizing large networks is one area where I find Matlab lacking and need to resort to third-party tools. Perhaps we can see better network visualiztion tools in the future?
% value at which edges are considered significant ov_cutoff=0.7; if ~isempty(ver('bioinfo')) xcm=x.*(x>ov_cutoff); bg=biograph(xcm,lbls(idx),'LayoutType','radial','showarrows','off'); view(bg) else error('Requires the Bioinformatics Toolbox') end