Path: news.mathworks.com!not-for-mail
From: "Ashish Uthama" <first.last@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Script will take far too long..
Date: Tue, 21 Jul 2009 16:47:23 -0400
Organization: TMW
Lines: 73
Message-ID: <op.uxfqo9sha5ziv5@uthamaa.dhcp.mathworks.com>
References: <h454s2$k0u$1@fred.mathworks.com>
NNTP-Posting-Host: uthamaa.dhcp.mathworks.com
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15
Content-Transfer-Encoding: 7bit
X-Trace: fred.mathworks.com 1248209243 20566 172.31.57.126 (21 Jul 2009 20:47:23 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Tue, 21 Jul 2009 20:47:23 +0000 (UTC)
User-Agent: Opera Mail/9.63 (Win32)
Xref: news.mathworks.com comp.soft-sys.matlab:557310


On Tue, 21 Jul 2009 15:28:02 -0400, David Kunik <kunik@ualberta.ca> wrote:

> Hey,
>
> So i have a very large data set (1.9GB) so i am unable to load this in  
> to Matlab for modification that way (this computer only has 1gb of  
> ram).  I need to do a couple simple regexprep calls (A_A -> 1) which is  
> easy enough but each row of data is like 1.5MB and it takes bloody  
> forever for regexprep to go through (there is allot of A_A's..).  By  
> "forever" i mean ~41 minutes.  For 1/624 rows.
>
> I can show you some of the code i have and you can laugh all you like.   
> I've just started matlab for my new job so i'm just glad the code works.
>
> function readlines()
> fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment  
> 03\BreastCancerDataset_SharedData.csv','r');
> fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> line = fgets(fid); % Get headers first.  Yes, cheap hack.
>     for i = 1:624
>         tic;
>         line = fgets(fid);
>                                  disp('Debug: Read line')
>         line = regexprep(line, 'A_A', '1');
>                                  disp('Debug: Replaced A_A') % These 3  
> parts take about 15 minutes EACH.
>         line = regexprep(line, 'A_B', '2');
>                                  disp('Debug: Replaced A_B')
>         line = regexprep(line, 'B_B', '3');
>                                  disp('Debug: Replaced B_B')
>         fwrite(fid2, line);
>         toc;
>         disp(line)
>     end
> fclose(fid);
> fclose(fid2);
> end
>
> So, theoretically, there is nothing wrong with my code as it works as i  
> want it to (everything is successfully replaced/outputted/written) but  
> 40 minutes for 1 line is ridiculous.  Any help optimizing this would be  
> greatly appreciated.


Just for kicks, give this a try. I would be curious to know its  
performance.

Copy the text below into a text file called 'replace.pl'.
In MATLAB, ensure that replace.pl is on the MATLAB path, and invoke it as  
shown:

--perl code below--

#Syntax: perl('replace.pl','input.csv','output.csv');

#Given an input and an output file
#replace A_A with 1, A_B with 2 and B_B with 3.

$inFile =shift @ARGV;
$outFile=shift @ARGV;

open(IFILE,$inFile) or die "Could not open input file";
open(OFILE,">$outFile") or die "Could not open output file";

while(<IFILE>){
     s/A_A/1/g;
     s/A_B/2/g;
     s/B_B/3/g;
     print OFILE;
}

close(IFILE);
close(OFILE);