Path: news.mathworks.com!not-for-mail
From: "Alan B" <monguin61REM@OVEyahoo.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Script will take far too long..
Date: Tue, 21 Jul 2009 20:30:19 +0000 (UTC)
Organization: UT
Lines: 44
Message-ID: <h458gr$fe0$1@fred.mathworks.com>
References: <h454s2$k0u$1@fred.mathworks.com> <h455ke$ab2$1@fred.mathworks.com> <h457ia$edm$1@fred.mathworks.com>
Reply-To: "Alan B" <monguin61REM@OVEyahoo.com>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1248208219 15808 172.30.248.38 (21 Jul 2009 20:30:19 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 21 Jul 2009 20:30:19 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1446885
Xref: news.mathworks.com comp.soft-sys.matlab:557303


"David Kunik" <kunik@ualberta.ca> wrote in message <h457ia$edm$1@fred.mathworks.com>...
> "Shanmugam Kannappan" <shanmugambe@gmail.com> wrote in message <h455ke$ab2$1@fred.mathworks.com>...
> > "David Kunik" <kunik@ualberta.ca> wrote in message <h454s2$k0u$1@fred.mathworks.com>...
> > > Hey,
> > > 
> > > So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram).  I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..).  By "forever" i mean ~41 minutes.  For 1/624 rows.
> > > 
> > > I can show you some of the code i have and you can laugh all you like.  I've just started matlab for my new job so i'm just glad the code works.
> > > 
> > > function readlines()
> > > fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
> > > fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> > > line = fgets(fid); % Get headers first.  Yes, cheap hack.
> > >     for i = 1:624
> > >         tic;
> > >         line = fgets(fid);
> > >                                  disp('Debug: Read line')
> > >         line = regexprep(line, 'A_A', '1');
> > >                                  disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
> > >         line = regexprep(line, 'A_B', '2');
> > >                                  disp('Debug: Replaced A_B')
> > >         line = regexprep(line, 'B_B', '3');
> > >                                  disp('Debug: Replaced B_B')
> > >         fwrite(fid2, line);
> > >         toc;
> > >         disp(line)
> > >     end
> > > fclose(fid);
> > > fclose(fid2);
> > > end
> > > 
> > > So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous.  Any help optimizing this would be greatly appreciated.
> > 
> > Hi!
> > 
> > I am not really clear with your explanation but
> > from the code it seems like replacing something & writing it to other file,
> > why dont you try fread instead of fgets.
> > fread will read all the strings from the file to a single variable & replace using regexprep.....
> > 
> > Shan....
> Good idea, however loading 2GB of data in to fread does not help me that much as this computer doesn't have enough memory.  Obviously it's time for a computer upgrade, but there must be a way to make this faster?  Maybe i just need to accept that this is how long it takes to modify 1.5mb text strings on the fly..

strrep might be faster than regexprep, unless regexprep is doing a check for trivial cases. I'm not sure how much that would help.