Path: news.mathworks.com!not-for-mail
From: "Shanmugam Kannappan" <shanmugambe@gmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Script will take far too long..
Date: Tue, 21 Jul 2009 19:41:02 +0000 (UTC)
Organization: Tata Elxsi
Lines: 39
Message-ID: <h455ke$ab2$1@fred.mathworks.com>
References: <h454s2$k0u$1@fred.mathworks.com>
Reply-To: "Shanmugam Kannappan" <shanmugambe@gmail.com>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1248205262 10594 172.30.248.37 (21 Jul 2009 19:41:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 21 Jul 2009 19:41:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1441885
Xref: news.mathworks.com comp.soft-sys.matlab:557283


"David Kunik" <kunik@ualberta.ca> wrote in message <h454s2$k0u$1@fred.mathworks.com>...
> Hey,
> 
> So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram).  I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..).  By "forever" i mean ~41 minutes.  For 1/624 rows.
> 
> I can show you some of the code i have and you can laugh all you like.  I've just started matlab for my new job so i'm just glad the code works.
> 
> function readlines()
> fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
> fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> line = fgets(fid); % Get headers first.  Yes, cheap hack.
>     for i = 1:624
>         tic;
>         line = fgets(fid);
>                                  disp('Debug: Read line')
>         line = regexprep(line, 'A_A', '1');
>                                  disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
>         line = regexprep(line, 'A_B', '2');
>                                  disp('Debug: Replaced A_B')
>         line = regexprep(line, 'B_B', '3');
>                                  disp('Debug: Replaced B_B')
>         fwrite(fid2, line);
>         toc;
>         disp(line)
>     end
> fclose(fid);
> fclose(fid2);
> end
> 
> So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous.  Any help optimizing this would be greatly appreciated.

Hi!

I am not really clear with your explanation but
from the code it seems like replacing something & writing it to other file,
why dont you try fread instead of fgets.
fread will read all the strings from the file to a single variable & replace using regexprep.....

Shan....