Path: news.mathworks.com!not-for-mail
From: <HIDDEN>
Newsgroups: comp.soft-sys.matlab
Subject: Script will take far too long..
Date: Tue, 21 Jul 2009 19:28:02 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 29
Message-ID: <h454s2$k0u$1@fred.mathworks.com>
Reply-To: <HIDDEN>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1248204482 20510 172.30.248.38 (21 Jul 2009 19:28:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 21 Jul 2009 19:28:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1899265
Xref: news.mathworks.com comp.soft-sys.matlab:557277


Hey,

So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram).  I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..).  By "forever" i mean ~41 minutes.  For 1/624 rows.

I can show you some of the code i have and you can laugh all you like.  I've just started matlab for my new job so i'm just glad the code works.

function readlines()
fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
line = fgets(fid); % Get headers first.  Yes, cheap hack.
    for i = 1:624
        tic;
        line = fgets(fid);
                                 disp('Debug: Read line')
        line = regexprep(line, 'A_A', '1');
                                 disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
        line = regexprep(line, 'A_B', '2');
                                 disp('Debug: Replaced A_B')
        line = regexprep(line, 'B_B', '3');
                                 disp('Debug: Replaced B_B')
        fwrite(fid2, line);
        toc;
        disp(line)
    end
fclose(fid);
fclose(fid2);
end

So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous.  Any help optimizing this would be greatly appreciated.