3.8

3.8 | 5 ratings Rate this file 27 downloads (last 30 days) File Size: 296.69 KB File ID: #23594

Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior

by C Jethro Lam

 

08 Apr 2009 (Updated 08 Apr 2009)

No BSD License  

Entry to Matlab contest Spring 2009

Download Now | Watch this File

File Information
Description

 In this work, we are interested at the following questions:

 1. How do we measure the similarity between two codes? (existence of similarity)

 2. How do we identify entries that are similar to each other? (similarity with others)

 3. How do the entries by one author evolve over time? (similarity with self)

 In order to define 'similarity', one must first define a measure for 'difference'. Some intuitive methods suggest comparing the number of characters, comparing the number of nodes, or observing the function or variable names. Apparently, these methods can be beaten by some simple code obfuscation.

 In this work, we introduce a measure of code similarity that is relatively immune to code obfuscation. The proposed approach is based on the algorithmic performance of the code. When a code is written, it consists of many operational statements(a=b+c), branching statements(if then else), memory allocation statements(zeros(100,1)), etc, that appear in a unique order characterized by the coding style of the author. When the code is executed, each statement takes up a certain amount of CPU runtime. If we measure and record the variation of CPU runtime across the lines of statements in the code, we can obtain a signature of the code that is unique to each author given that the code is sufficiently complicated. By correlating the signatures, we can provide a quantitative measurement of the similarity of the codes.

Acknowledgements

The author wishes to acknowledge the following in the creation of this submission:
bsxfun, MATLAB Contest - Data Visualization, MATLAB Contest Statistics

MATLAB release MATLAB 7.8 (R2009a)
Other requirements Please download contest_data.mat from ID 23509 "MATLAB Contest - Data Visualization" starter kit
Zip File Content  
Published M Files Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior
Other Files
bsxarg.c,
bsxarg.m,
bsxarg.mexw32,
bsxfun.m,
code.m,
compute_clam_sig.m,
compute_correlation.m,
filterByAuthor.m,
html/main_publish.png,
html/main_publish_01.png,
html/main_publish_02.png,
html/main_publish_03.png,
html/main_publish_04.png,
html/main_publish_05.png,
html/main_publish_06.png,
html/main_publish_07.png,
html/main_publish_08.png,
html/main_publish_09.png,
html/Thumbs.db,
main_publish.m,
mostActive.m,
prepareData.m,
test_code.m,
testsuite_sample.mat
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (7)
08 Apr 2009 Alan Chalker

This is a VERY cool analysis and definitely the best entry so far in the spirit of the competition, in that it visualizes some new nuggets of information data mined out of the data set.

08 Apr 2009 Kenneth Eaton

I can verify that your findings about my submissions were correct: my submissions for that contest were totally out of left field and very different from the others. I was off in my own little world trying out different codes without looking at anything anyone else was doing. =)

08 Apr 2009 Yi Cao

Beautiful analysis. The similarity measure is novel.

08 Apr 2009 us

contains bsxfun
(clashing with ML's stock function of the same name:
Warning: Function ...\bsxfun.m has the same name as a
MATLAB builtin. We suggest you rename the function to avoid a potential
name conflict.
) created by james tursa - but he is not being acknowledged anywhere in this submission for his nice contribution...

a collection of incomprehensible functions, which yield this when run one by one
help code
No help found for code.m.
code
??? Input argument "board" is undefined.
help compute_clam_sig
  Filename: compute_clam_sig.m
  Author: C Jethro Lam, jethrolam@gmail.com
  Date: 4/4/2009
  Purpose: Compute the clam signature of an entry
 compute_clam_sig
??? Input argument "entry_id" is undefined.
help compute_correlation
  Filename: compute_and_plot_correlation.m
  Author: C Jethro Lam, jethrolam@gmail.com
  Date: 4/4/2009
  Purpose: Compute and plot the correlation matrix
compute_correlation
?? Input argument "d" is undefined.
help filterByAuthor
No help found for filterByAuthor.m.
filterByAuthor
??? Input argument "d" is undefined.
help main_publish
  Finding the Similar Entries: A Quantitative Approach based on CPU Runtime Behavior
  Chunwei Jethro Lam, jethrolam@gmail.com, April 8 2009
    Published output in the Help browser
       showdemo main_publish
main_publish
??? Error using ==> load
Unable to read file contest_data: No such file or directory.
help mostActive
No help found for mostActive.m.
mostActive
??? Input argument "s" is undefined.
help prepareData
No help found for prepareData.m.
prepareData
??? Error using ==> load
Unable to read file contest_data.mat: No such file or directory.
help test_code
  Entry ID: 42204. Author: JohanH
test_code
??? Index exceeds matrix dimensions.

what - does anyone think - is the average ML user gain from this...

us

08 Apr 2009 C Jethro Lam

Thanks for your comments!

I want to acknowledge Matthew Simoneau in his work "MATLAB Contest Statistics" 23510. Also to James Tursa who wrote bsxfun, although I didn't really use bsxfun in my code - one of the entries I am testing does. You can delete bsxfun if you have R2009a.

us:
You have to get "contest_data.mat" first and run "main_publish.m". In the format of this contest, all documentations are included in the published m file.

09 Apr 2009 Rajiv Narayan

Really like this approach to comparing code.

09 Apr 2009 Doug Hull

To answer the question of using data not normally on the MATLAB path, I offer the following modification.

if ~exist('contest_data.mat','file')
    warning ('This was an entry to the MATLAB programming contest (http://www.mathworks.com/contest/datavis/home.html). Please load the contest data and unzip it to place contest_data.mat on your MATLAB path.')
    web('http://www.mathworks.com/matlabcentral/fileexchange/23509?controller=file_infos&download=true')
end

Please login to add a comment or rating.
Updates
08 Apr 2009

I did not change the m files that I submitted. I only added acknowledge to the front info page.

Tag Activity for this File
Tag Applied By Date/Time
vis2009 C Jethro Lam 08 Apr 2009 12:16:33
vis2009 Alan Chalker 08 Apr 2009 12:40:26
vis2009 Yi Cao 08 Apr 2009 14:38:13
vis2009 Steve Hoelzer 08 Apr 2009 16:32:06
vis2009 Nathan 08 Apr 2009 17:29:07
vis2009 Andreas Bonelli 09 Apr 2009 04:23:15
vis2009 Rafal Kasztelanic 09 Apr 2009 04:25:18
vis2009 Rajiv Narayan 09 Apr 2009 05:14:06
vis2009 Matthew Simoneau 09 Apr 2009 12:25:43
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com