No BSD License  

Highlights from
Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior

image thumbnail

Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior

by

 

08 Apr 2009 (Updated )

Entry to Matlab contest Spring 2009

Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior

Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior

Chunwei Jethro Lam, jethrolam@gmail.com, April 8 2009

Contents

Introduction

In this work, we are interested at the following questions:

1. How do we measure the similarity between two codes? (existence of similarity)

2. How do we identify entries that are similar to each other? (similarity with others)

3. How do the entries by one author evolve over time? (similarity with self)

In order to define 'similarity', one must first define a measure for 'difference'. Some intuitive methods suggest comparing the number of characters, comparing the number of nodes, or observing the function or variable names. Apparently, these methods can be beaten by some simple code obfuscation.

In this work, we introduce a measure of code similarity that is relatively immune to code obfuscation. The proposed approach is based on the algorithmic performance of the code. When a code is written, it consists of many operational statements(a=b+c), branching statements(if then else), memory allocation statements(zeros(100,1)), etc, that appear in a unique order characterized by the coding style of the author. When the code is executed, each statement takes up a certain amount of CPU runtime. If we measure and record the variation of CPU runtime across the lines of statements in the code, we can obtain a signature of the code that is unique to each author given that the code is sufficiently complicated. By correlating the signatures, we can provide a quantitative measurement of the similarity of the codes.

Per-line CPU Runtime Signature

In this section, we demonstrate how a per-line CPU runtime signature is measured. In our program, we call this signature clam_sig (I confess that I run out of idea. 'clam' is my first_init+last name). We take Yi Cao's winning entry as an example. First, we extract the source code and display the first 10 lines.

clear
load contest_data;
winner		= d(3909); % Author: Yi Cao
code_cell	= allLineList(winner.lines);
num_lines	= size(code_cell, 2);
fprintf(sprintf('%s\n', code_cell{1:10}));
function [moves,score] = submit3(board)
    rand( 'state', 0 );
    rand( 57, 1 );
    [m,n] = size( board );
    pegCount = sum( board( : )>0 );
        rows = m + 4;
    rv = 5:rows;
    cols = n + 4;
    cv = 5:cols;
    fill = (pegCount - nnz( board<0 ))/(m*n);

Next, we examine the source code line by line, and insert stopwatch timers inbetween lines that contain at least one operational statement. These stopwatch timers allow us to measure and record the CPU elapse time of all the statements inbetween. We write the modified source code in the current directory. The first 10 lines of the new code are displayed below. (NOTE: I would like to know if there is a way to edit a function directly in the function workspace. Any help is appreciated.)

% Construct stopwatch timer commands to be appended to source code
append_cell = cell(1, num_lines);
for line_index = 1:num_lines
  line_string = sprintf('%s', code_cell{line_index});	% Convert into string
  % Check if it is an operational statement that ends with ';' before linefeed
  if (line_string(end)==';')
    cmd_string = sprintf('line_runtime=toc; runtime_info(%0.1d, :) = runtime_info(%0.1d, :) + [0 line_runtime 1]; tic;', line_index, line_index);
    append_cell(line_index) = {cmd_string};
  end
  % Check if it is a function declaration statement that starts with keyword 'function' followed by a space ' '
  if ~isempty(strfind(line_string, 'function '))
    append_cell(line_index) = {sprintf('\neval(''if exist(''''runtime_info'''')==0; global runtime_info; end; tic;'')', line_index)};
  end
end
test_code_cell = cell(1, num_lines); % Merge append_cell with code_cell to form the test_code_cell
for line_index = 1:num_lines
  test_code_cell{line_index} = [code_cell{line_index}, append_cell{line_index}];
end
% Save test code in current directory
test_code_filename	= sprintf('test_code.m');
test_fid		= fopen(test_code_filename, 'w');
fprintf(test_fid, '%s', sprintf('%s\n', test_code_cell{:}));
fclose(test_fid);
% Display
fprintf(sprintf('%s\n', test_code_cell{1:10}));
function [moves,score] = submit3(board)
eval('if exist(''runtime_info'')==0; global runtime_info; end; tic;')
    rand( 'state', 0 );line_runtime=toc; runtime_info(2, :) = runtime_info(2, :) + [0 line_runtime 1]; tic;
    rand( 57, 1 );line_runtime=toc; runtime_info(3, :) = runtime_info(3, :) + [0 line_runtime 1]; tic;
    [m,n] = size( board );line_runtime=toc; runtime_info(4, :) = runtime_info(4, :) + [0 line_runtime 1]; tic;
    pegCount = sum( board( : )>0 );line_runtime=toc; runtime_info(5, :) = runtime_info(5, :) + [0 line_runtime 1]; tic;
        rows = m + 4;line_runtime=toc; runtime_info(6, :) = runtime_info(6, :) + [0 line_runtime 1]; tic;
    rv = 5:rows;line_runtime=toc; runtime_info(7, :) = runtime_info(7, :) + [0 line_runtime 1]; tic;
    cols = n + 4;line_runtime=toc; runtime_info(8, :) = runtime_info(8, :) + [0 line_runtime 1]; tic;
    cv = 5:cols;line_runtime=toc; runtime_info(9, :) = runtime_info(9, :) + [0 line_runtime 1]; tic;
    fill = (pegCount - nnz( board<0 ))/(m*n);line_runtime=toc; runtime_info(10, :) = runtime_info(10, :) + [0 line_runtime 1]; tic;

The next step is to initialize the global variable called runtime_info and execute the test code. We choose game board #32 as the input to the test code.

global runtime_info;
runtime_info		= zeros(num_lines, 3);  % runtime_info is the global variable that holds the CPU elapse time information
runtime_info(:,1)	= (1:num_lines).'; 	% first column: line index
runtime_info(:,2)	= 0;			% second column: total elapse time of this line
runtime_info(:,3)	= 0;			% third column: number of times this line is executed
load testsuite_sample;				% load game board
test_code(testsuite(32).board);			% run test code
clear test_code;				% IMPORTANT: Clear test code from function workspace so it does not interfere next call

Once the runtime information becomes available, we can compute and plot the averaged per-line CPU runtime.

per_line_cpu_runtime	= zeros(num_lines,1);
active_line_index	= find(runtime_info(:,3)~=0);
per_line_cpu_runtime(active_line_index) = runtime_info(active_line_index,2)./runtime_info(active_line_index,3);
clf;
plot(1:num_lines, per_line_cpu_runtime);
xlabel('Line Number');
ylabel('Averaged CPU Elapse Time (sec)');

From the plot, we notice that the per-line CPU runtime is a highly sparse function, i.e., most CPU time is spent only on a few lines. Referring to the source code, those lines are mainly memory allocation statements. The signature, clam_sig, is defined to be a vector of 1's and 0's, where the 1's indicate the lines with above-average CPU runtime, and 0's indicate the line with below-average CPU runtime. Since the per-line CPU runtime is sparse, information about the peak location are properly preserved in the signature.

clam_sig = single(per_line_cpu_runtime > mean(per_line_cpu_runtime));
fprintf('clam_sig = %s...\n',num2str(clam_sig(1:25)));
clam_sig = 0100000001111000000000000...

Before we move on to another topic, we compute the per-line CPU runtime for two additional entries that were submitted immediately after Yi Cao's.

[per_line_cpu_runtime_1 d(3909).clam_sig] = compute_clam_sig(d, allLineList, 3909);	% Author: Yi Cao
[per_line_cpu_runtime_2 d(3910).clam_sig] = compute_clam_sig(d, allLineList, 3910);	% Author: Ken Eaton
[per_line_cpu_runtime_3 d(3911).clam_sig] = compute_clam_sig(d, allLineList, 3911);	% Author: DrSeuss
clf;
subplot(3,1,1); plot(per_line_cpu_runtime_1); title(sprintf('Entry ID %0.1d, Author: %s', d(3909).id, d(3909).author));
subplot(3,1,2); plot(per_line_cpu_runtime_2); title(sprintf('Entry ID %0.1d, Author: %s', d(3910).id, d(3910).author));
subplot(3,1,3); plot(per_line_cpu_runtime_3); title(sprintf('Entry ID %0.1d, Author: %s', d(3911).id, d(3911).author));
From these plots, we observe that DrSeuss's code is similar to Yi Cao's in terms of CPU runtime behavior.  We can verify it by comparing the first 10 lines of the two entries
code_cell = allLineList(d(3909).lines); % Yi Cao's source code
fprintf(sprintf('%s\n', code_cell{1:10}));
code_cell = allLineList(d(3911).lines); % DrSeuss's source code
fprintf(sprintf('%s\n', code_cell{1:10}));
function [moves,score] = submit3(board)
    rand( 'state', 0 );
    rand( 57, 1 );
    [m,n] = size( board );
    pegCount = sum( board( : )>0 );
        rows = m + 4;
    rv = 5:rows;
    cols = n + 4;
    cv = 5:cols;
    fill = (pegCount - nnz( board<0 ))/(m*n);
function [AA,AB] = solver(AD)
    rand( 'state', 323 );
    rand( 57, 1 );
    [AL,n] = size( AD );
    CK = sum( AD( : )>0 );
    rows = AL + 4;
    CP = 5:rows;
    cols = n + 4;
    cv = 5:cols;
    fill = (CK - nnz( AD<0 ))/(AL*n);

Correlation Matrix

After computing the signatures of the codes, we can measure their difference quantitatively by finding the correlation between their signatures. The pair-wise correlation coefficient is stored in a matrix. Using the previous three entries as an example, the 3x3 clam_corr_matrix has the form

clam_corr_matrix = zeros(3,3);
for m = 1:3
  for n = 1:3
    clam_corr_matrix(m,n) = max(conv(d(3909-1+m).clam_sig, flipud(d(3909-1+n).clam_sig)));
  end
end
clam_corr_matrix
clam_corr_matrix =

    29     4    20
     4     7     4
    20     4    28

The (i,j)th entry of the correlation matrix shows the level of similarity between the ith code and the jth code. At (1,3) and (3,1), the coefficients are high. This means there is a high similarity between Yi Cao's work and DrSeuss's work. On the other hand, the (2,1)th coefficient in the matrix is low. This means Ken Eaton's code is very different from the other two. (NOTE: when the correlation is low, it may also mean that the signature does not have enough 1's to work with, i.e., the signature does not carry enough information about the code. These outliers can be detected by the diagonal value in the correlation matrix: a small diagonal value means that the code is hard to correlate even to itself.)

Similarity among 50 Most Recent Entries

We now possess all the necessary tools to measure the similarity of the entries. We consider the 50 most recently submitted entries in the contest and compute the correlation matrix. The matrix is displayed as an 6-bit image. The coefficients are normalized so that the maximum correlation is 63 (displayed as dark red) and minimum correlation is 0 (displayed as dark blue). We sort the entries according to the sum of their correlation coefficients with other entries. Entries with high correlation with others are located at the bottom left corner of the graph.

clear
load contest_data;
[s,p,leaders]	= prepareData;
num_entries	= 50;
d		= p(end-num_entries+1:end);

% Compute correlation matrix
clam_corr_matrix= compute_correlation(d, allLineList);
[tmp sort_id]	= sort(sum(clam_corr_matrix));  % sort the entries

% Plot correlation matrix as image
clf;
image(clam_corr_matrix(sort_id, sort_id)*64/max(max(clam_corr_matrix)));
set(gca, 'XTick', [], 'YTick', [], 'XTickLabel', [], 'YTickLabel', [])
text(-0.1*ones(num_entries,1),1:num_entries, {d(sort_id).author},'HorizontalAlignment','right','rotation',0, 'fontsize', 8);
text(1:num_entries,(1+num_entries)*ones(num_entries,1),{d(sort_id).author},'HorizontalAlignment','right','rotation',90, 'fontsize',8);
title('Correlation Matrix for 50 Most Recent Entries');
colorbar
set(gcf, 'Position', [100 100 [900 768]/(1-2*0.01)]);

From the graph we notice several interesting patterns:

1. The entries of four authors, namely tev, DrSeuss, MikeR and PU, are extremely similar to each other. In particular, DrSeuss correlates to self and to each other equally.

2. The entries of Ken Eaton and ms are different from the rest of the group. This mean that they followed some totally different algorithmic approaches.

3. Other than the two lone rangers mentioned above, all authors seem to be somewhat related. A possible reason is that they all share some common pieces of code. (Since I was not involved in that contest, I would appreciate if someone can verify whether my guess is correct.)

Similarity among 50 Most Active Authors

We now measure the similarity of the codes by 50 most active authors. For each author, we look at the most recently submitted entry. We sort the entries according to score. The high scoring entries are located at the top left corner in the plot. (NOTE: Each correlation matrix can take a few minutes to generate.)

clear
load contest_data;
[s,p,leaders]				= prepareData;
[uniqueAuthors,u_index,u_index2]	= unique({p.author});
authorCount				= hist(u_index2,min(u_index2):max(u_index2));
[sortedAuthorCount,iSortedAuthorCount] 	= sort(authorCount);
sortedAuthors				= uniqueAuthors(iSortedAuthorCount);
num_entries 				= 25;
d					= p(u_index(iSortedAuthorCount(end-num_entries+1:end)));

% Compute correlation matrix
clam_corr_matrix = compute_correlation(d, allLineList);
[tmp sort_id]	 = sort([d.score]);  % sort the entries

% Plot correlation matrix as image
clf;
image(clam_corr_matrix(sort_id, sort_id)*64/max(max(clam_corr_matrix)));
set(gca, 'XTick', [], 'YTick', [], 'XTickLabel', [], 'YTickLabel', [])
text(-0.1*ones(num_entries,1),1:num_entries, {d(sort_id).author},'HorizontalAlignment','right','rotation',0, 'fontsize', 8);
text(1:num_entries,(1+num_entries)*ones(num_entries,1),{d(sort_id).author},'HorizontalAlignment','right','rotation',90, 'fontsize',8);
title('Correlation Matrix for 50 Most Active Authors; Sorted by Score');
colorbar;
set(gcf, 'Position', [100 100 [900 768]/(1-2*0.01)]);

The following patterns are observed:

1. Out of 25 unique authors, 8 of them are different from the majority of the group. These are the lone rangers who did not collaborate with others and did not follow the mainstream approach (or they are hard to correlate with). Among them, Jan Langer performs the best.

2. There appears to be two groups of collaborators with quite similar codes. They are {JohanH, the cyclist, Alan Chalker, Bao Lei} and {BirdBrain, MikeR, David Jones, APhys}. (Since I was not involved in that contest, I would appreciate if someone can verify whether my guess is correct.)

Similarity among Entries by One Author

We now examine how the entries submitted by one author evolve over the order of submission. We consider the 38 entries created by the winning author Yi Cao. We construct the correlation matrix of all entries created by the author.

clear
load contest_data;
[s,p,leaders]	= prepareData;
d		= p(filterByAuthor(p, 'Yi Cao'));
num_entries	= min(40, size(d,1));
d		= d(end-num_entries+1:end);

% Compute correlation matrix
clam_corr_matrix = compute_correlation(d, allLineList);
[tmp sort_id]	 = sort([d.date]);  % sort the entries

% Plot correlation matrix as image
clf;
image(clam_corr_matrix(sort_id, sort_id)*64/max(max(clam_corr_matrix)));
title(sprintf('Correlation Matrix for the %0.1d Entries by %s; Sorted by Order of Submission', num_entries, d(1).author));
colorbar
set(gcf, 'Position', [100 100 [900 768]/(1-2*0.01)]);

The major breakthrough appears to come from the 18th submission. We can see that all the succeeding entries are similar to this entry. At the first 17 entries, the author was still searching for the best algorithm and therefore they are the trial-and-error stage of his entries. At the 18th submission, the author decided to focus on one approach and spend the rest of the entries fine tuning it.

Here we compute the correlation matrix of the entries by a few more authors.

clear
load contest_data;
[s,p,leaders]	= prepareData;
d		= p(filterByAuthor(p, 'Alan Chalker'));
num_entries	= min(40, size(d,1));
d		= d(end-num_entries+1:end);
% Compute correlation matrix
clam_corr_matrix = compute_correlation(d, allLineList);
[tmp sort_id]	 = sort([d.date]);  % sort the entries
% Plot correlation matrix as image
clf;
image(clam_corr_matrix(sort_id, sort_id)*64/max(max(clam_corr_matrix)));
title(sprintf('Correlation Matrix for the %0.1d Entries by %s; Sorted by Order of Submission', num_entries, d(1).author));
colorbar
set(gcf, 'Position', [100 100 [900 768]/(1-2*0.01)]);
clear
load contest_data;
[s,p,leaders]	= prepareData;
d		= p(filterByAuthor(p, 'Jan Langer'));
num_entries	= min(40, size(d,1));
d		= d(end-num_entries+1:end);
% Compute correlation matrix
clam_corr_matrix = compute_correlation(d, allLineList);
[tmp sort_id]	 = sort([d.date]);  % sort the entries
% Plot correlation matrix as image
clf;
image(clam_corr_matrix(sort_id, sort_id)*64/max(max(clam_corr_matrix)));
title(sprintf('Correlation Matrix for the %0.1d Entries by %s; Sorted by Order of Submission', num_entries, d(1).author));
colorbar
set(gcf, 'Position', [100 100 [900 768]/(1-2*0.01)]);
clear
load contest_data;
[s,p,leaders]	= prepareData;
d		= p(filterByAuthor(p, 'JohanH'));
num_entries	= min(40, size(d,1));
d		= d(end-num_entries+1:end);
% Compute correlation matrix
clam_corr_matrix = compute_correlation(d, allLineList);
[tmp sort_id]	 = sort([d.date]);  % sort the entries
% Plot correlation matrix as image
clf;
image(clam_corr_matrix(sort_id, sort_id)*64/max(max(clam_corr_matrix)));
title(sprintf('Correlation Matrix for the %0.1d Entries by %s; Sorted by Order of Submission', num_entries, d(1).author));
colorbar
set(gcf, 'Position', [100 100 [900 768]/(1-2*0.01)]);

Contact us