Path: news.mathworks.com!not-for-mail
From: "Ashish Uthama" <first.last@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Script will take far too long..
Date: Tue, 21 Jul 2009 16:30:01 -0400
Organization: TMW
Lines: 56
Message-ID: <op.uxfpwbusa5ziv5@uthamaa.dhcp.mathworks.com>
References: <h454s2$k0u$1@fred.mathworks.com>
NNTP-Posting-Host: uthamaa.dhcp.mathworks.com
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15
Content-Transfer-Encoding: 7bit
X-Trace: fred.mathworks.com 1248208202 14666 172.31.57.126 (21 Jul 2009 20:30:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 21 Jul 2009 20:30:02 +0000 (UTC)
User-Agent: Opera Mail/9.63 (Win32)
Xref: news.mathworks.com comp.soft-sys.matlab:557302


On Tue, 21 Jul 2009 15:28:02 -0400, David Kunik <kunik@ualberta.ca> wrote:

> Hey,
>
> So i have a very large data set (1.9GB) so i am unable to load this in  
> to Matlab for modification that way (this computer only has 1gb of  
> ram).  I need to do a couple simple regexprep calls (A_A -> 1) which is  
> easy enough but each row of data is like 1.5MB and it takes bloody  
> forever for regexprep to go through (there is allot of A_A's..).  By  
> "forever" i mean ~41 minutes.  For 1/624 rows.
>
> I can show you some of the code i have and you can laugh all you like.   
> I've just started matlab for my new job so i'm just glad the code works.
>
> function readlines()
> fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment  
> 03\BreastCancerDataset_SharedData.csv','r');
> fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> line = fgets(fid); % Get headers first.  Yes, cheap hack.
>     for i = 1:624
>         tic;
>         line = fgets(fid);
>                                  disp('Debug: Read line')
>         line = regexprep(line, 'A_A', '1');
>                                  disp('Debug: Replaced A_A') % These 3  
> parts take about 15 minutes EACH.
>         line = regexprep(line, 'A_B', '2');
>                                  disp('Debug: Replaced A_B')
>         line = regexprep(line, 'B_B', '3');
>                                  disp('Debug: Replaced B_B')
>         fwrite(fid2, line);
>         toc;
>         disp(line)
>     end
> fclose(fid);
> fclose(fid2);
> end
>
> So, theoretically, there is nothing wrong with my code as it works as i  
> want it to (everything is successfully replaced/outputted/written) but  
> 40 minutes for 1 line is ridiculous.  Any help optimizing this would be  
> greatly appreciated.

For one, are you sure you dont need the 'rt' mode in FOPEN?
See the help on FGETS:

     FGETS is intended for use with files that contain newline characters.
     Given a file with no newline characters, FGETS may take a long time to
     execute.

So if you open in 'r' (binary mode) it might not recognize the newline and  
read in the full file anyway. (in which case you doing the full file in  
*each* loop iteration). A easy way to check this would be to single step  
(debug) your code and check the size of 'line'.