Thread Subject: Script will take far too long..

Subject: Script will take far too long..

From: David Kunik

Date: 21 Jul, 2009 19:28:02

Message: 1 of 12

Hey,

So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram). I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..). By "forever" i mean ~41 minutes. For 1/624 rows.

I can show you some of the code i have and you can laugh all you like. I've just started matlab for my new job so i'm just glad the code works.

function readlines()
fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
line = fgets(fid); % Get headers first. Yes, cheap hack.
    for i = 1:624
        tic;
        line = fgets(fid);
                                 disp('Debug: Read line')
        line = regexprep(line, 'A_A', '1');
                                 disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
        line = regexprep(line, 'A_B', '2');
                                 disp('Debug: Replaced A_B')
        line = regexprep(line, 'B_B', '3');
                                 disp('Debug: Replaced B_B')
        fwrite(fid2, line);
        toc;
        disp(line)
    end
fclose(fid);
fclose(fid2);
end

So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous. Any help optimizing this would be greatly appreciated.

Subject: Script will take far too long..

From: Shanmugam Kannappan

Date: 21 Jul, 2009 19:41:02

Message: 2 of 12

"David Kunik" <kunik@ualberta.ca> wrote in message <h454s2$k0u$1@fred.mathworks.com>...
> Hey,
>
> So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram). I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..). By "forever" i mean ~41 minutes. For 1/624 rows.
>
> I can show you some of the code i have and you can laugh all you like. I've just started matlab for my new job so i'm just glad the code works.
>
> function readlines()
> fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
> fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> line = fgets(fid); % Get headers first. Yes, cheap hack.
> for i = 1:624
> tic;
> line = fgets(fid);
> disp('Debug: Read line')
> line = regexprep(line, 'A_A', '1');
> disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
> line = regexprep(line, 'A_B', '2');
> disp('Debug: Replaced A_B')
> line = regexprep(line, 'B_B', '3');
> disp('Debug: Replaced B_B')
> fwrite(fid2, line);
> toc;
> disp(line)
> end
> fclose(fid);
> fclose(fid2);
> end
>
> So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous. Any help optimizing this would be greatly appreciated.

Hi!

I am not really clear with your explanation but
from the code it seems like replacing something & writing it to other file,
why dont you try fread instead of fgets.
fread will read all the strings from the file to a single variable & replace using regexprep.....

Shan....

Subject: Script will take far too long..

From: David Kunik

Date: 21 Jul, 2009 20:14:02

Message: 3 of 12

"Shanmugam Kannappan" <shanmugambe@gmail.com> wrote in message <h455ke$ab2$1@fred.mathworks.com>...
> "David Kunik" <kunik@ualberta.ca> wrote in message <h454s2$k0u$1@fred.mathworks.com>...
> > Hey,
> >
> > So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram). I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..). By "forever" i mean ~41 minutes. For 1/624 rows.
> >
> > I can show you some of the code i have and you can laugh all you like. I've just started matlab for my new job so i'm just glad the code works.
> >
> > function readlines()
> > fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
> > fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> > line = fgets(fid); % Get headers first. Yes, cheap hack.
> > for i = 1:624
> > tic;
> > line = fgets(fid);
> > disp('Debug: Read line')
> > line = regexprep(line, 'A_A', '1');
> > disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
> > line = regexprep(line, 'A_B', '2');
> > disp('Debug: Replaced A_B')
> > line = regexprep(line, 'B_B', '3');
> > disp('Debug: Replaced B_B')
> > fwrite(fid2, line);
> > toc;
> > disp(line)
> > end
> > fclose(fid);
> > fclose(fid2);
> > end
> >
> > So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous. Any help optimizing this would be greatly appreciated.
>
> Hi!
>
> I am not really clear with your explanation but
> from the code it seems like replacing something & writing it to other file,
> why dont you try fread instead of fgets.
> fread will read all the strings from the file to a single variable & replace using regexprep.....
>
> Shan....
Good idea, however loading 2GB of data in to fread does not help me that much as this computer doesn't have enough memory. Obviously it's time for a computer upgrade, but there must be a way to make this faster? Maybe i just need to accept that this is how long it takes to modify 1.5mb text strings on the fly..

Subject: Script will take far too long..

From: Ashish Uthama

Date: 21 Jul, 2009 20:30:01

Message: 4 of 12

On Tue, 21 Jul 2009 15:28:02 -0400, David Kunik <kunik@ualberta.ca> wrote:

> Hey,
>
> So i have a very large data set (1.9GB) so i am unable to load this in
> to Matlab for modification that way (this computer only has 1gb of
> ram). I need to do a couple simple regexprep calls (A_A -> 1) which is
> easy enough but each row of data is like 1.5MB and it takes bloody
> forever for regexprep to go through (there is allot of A_A's..). By
> "forever" i mean ~41 minutes. For 1/624 rows.
>
> I can show you some of the code i have and you can laugh all you like.
> I've just started matlab for my new job so i'm just glad the code works.
>
> function readlines()
> fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment
> 03\BreastCancerDataset_SharedData.csv','r');
> fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> line = fgets(fid); % Get headers first. Yes, cheap hack.
> for i = 1:624
> tic;
> line = fgets(fid);
> disp('Debug: Read line')
> line = regexprep(line, 'A_A', '1');
> disp('Debug: Replaced A_A') % These 3
> parts take about 15 minutes EACH.
> line = regexprep(line, 'A_B', '2');
> disp('Debug: Replaced A_B')
> line = regexprep(line, 'B_B', '3');
> disp('Debug: Replaced B_B')
> fwrite(fid2, line);
> toc;
> disp(line)
> end
> fclose(fid);
> fclose(fid2);
> end
>
> So, theoretically, there is nothing wrong with my code as it works as i
> want it to (everything is successfully replaced/outputted/written) but
> 40 minutes for 1 line is ridiculous. Any help optimizing this would be
> greatly appreciated.

For one, are you sure you dont need the 'rt' mode in FOPEN?
See the help on FGETS:

     FGETS is intended for use with files that contain newline characters.
     Given a file with no newline characters, FGETS may take a long time to
     execute.

So if you open in 'r' (binary mode) it might not recognize the newline and
read in the full file anyway. (in which case you doing the full file in
*each* loop iteration). A easy way to check this would be to single step
(debug) your code and check the size of 'line'.

Subject: Script will take far too long..

From: Alan B

Date: 21 Jul, 2009 20:30:19

Message: 5 of 12

"David Kunik" <kunik@ualberta.ca> wrote in message <h457ia$edm$1@fred.mathworks.com>...
> "Shanmugam Kannappan" <shanmugambe@gmail.com> wrote in message <h455ke$ab2$1@fred.mathworks.com>...
> > "David Kunik" <kunik@ualberta.ca> wrote in message <h454s2$k0u$1@fred.mathworks.com>...
> > > Hey,
> > >
> > > So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram). I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..). By "forever" i mean ~41 minutes. For 1/624 rows.
> > >
> > > I can show you some of the code i have and you can laugh all you like. I've just started matlab for my new job so i'm just glad the code works.
> > >
> > > function readlines()
> > > fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment 03\BreastCancerDataset_SharedData.csv','r');
> > > fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> > > line = fgets(fid); % Get headers first. Yes, cheap hack.
> > > for i = 1:624
> > > tic;
> > > line = fgets(fid);
> > > disp('Debug: Read line')
> > > line = regexprep(line, 'A_A', '1');
> > > disp('Debug: Replaced A_A') % These 3 parts take about 15 minutes EACH.
> > > line = regexprep(line, 'A_B', '2');
> > > disp('Debug: Replaced A_B')
> > > line = regexprep(line, 'B_B', '3');
> > > disp('Debug: Replaced B_B')
> > > fwrite(fid2, line);
> > > toc;
> > > disp(line)
> > > end
> > > fclose(fid);
> > > fclose(fid2);
> > > end
> > >
> > > So, theoretically, there is nothing wrong with my code as it works as i want it to (everything is successfully replaced/outputted/written) but 40 minutes for 1 line is ridiculous. Any help optimizing this would be greatly appreciated.
> >
> > Hi!
> >
> > I am not really clear with your explanation but
> > from the code it seems like replacing something & writing it to other file,
> > why dont you try fread instead of fgets.
> > fread will read all the strings from the file to a single variable & replace using regexprep.....
> >
> > Shan....
> Good idea, however loading 2GB of data in to fread does not help me that much as this computer doesn't have enough memory. Obviously it's time for a computer upgrade, but there must be a way to make this faster? Maybe i just need to accept that this is how long it takes to modify 1.5mb text strings on the fly..

strrep might be faster than regexprep, unless regexprep is doing a check for trivial cases. I'm not sure how much that would help.

Subject: Script will take far too long..

From: Ashish Uthama

Date: 21 Jul, 2009 20:47:23

Message: 6 of 12

On Tue, 21 Jul 2009 15:28:02 -0400, David Kunik <kunik@ualberta.ca> wrote:

> Hey,
>
> So i have a very large data set (1.9GB) so i am unable to load this in
> to Matlab for modification that way (this computer only has 1gb of
> ram). I need to do a couple simple regexprep calls (A_A -> 1) which is
> easy enough but each row of data is like 1.5MB and it takes bloody
> forever for regexprep to go through (there is allot of A_A's..). By
> "forever" i mean ~41 minutes. For 1/624 rows.
>
> I can show you some of the code i have and you can laugh all you like.
> I've just started matlab for my new job so i'm just glad the code works.
>
> function readlines()
> fid = fopen('C:\Documents and Settings\xxx\xxx\Assignment
> 03\BreastCancerDataset_SharedData.csv','r');
> fid2 = fopen('C:\Documents and Settings\xxx\xxx\SNPDataConv.csv', 'w');
> line = fgets(fid); % Get headers first. Yes, cheap hack.
> for i = 1:624
> tic;
> line = fgets(fid);
> disp('Debug: Read line')
> line = regexprep(line, 'A_A', '1');
> disp('Debug: Replaced A_A') % These 3
> parts take about 15 minutes EACH.
> line = regexprep(line, 'A_B', '2');
> disp('Debug: Replaced A_B')
> line = regexprep(line, 'B_B', '3');
> disp('Debug: Replaced B_B')
> fwrite(fid2, line);
> toc;
> disp(line)
> end
> fclose(fid);
> fclose(fid2);
> end
>
> So, theoretically, there is nothing wrong with my code as it works as i
> want it to (everything is successfully replaced/outputted/written) but
> 40 minutes for 1 line is ridiculous. Any help optimizing this would be
> greatly appreciated.


Just for kicks, give this a try. I would be curious to know its
performance.

Copy the text below into a text file called 'replace.pl'.
In MATLAB, ensure that replace.pl is on the MATLAB path, and invoke it as
shown:

--perl code below--

#Syntax: perl('replace.pl','input.csv','output.csv');

#Given an input and an output file
#replace A_A with 1, A_B with 2 and B_B with 3.

$inFile =shift @ARGV;
$outFile=shift @ARGV;

open(IFILE,$inFile) or die "Could not open input file";
open(OFILE,">$outFile") or die "Could not open output file";

while(<IFILE>){
     s/A_A/1/g;
     s/A_B/2/g;
     s/B_B/3/g;
     print OFILE;
}

close(IFILE);
close(OFILE);

Subject: Script will take far too long..

From: Jan Simon

Date: 21 Jul, 2009 23:08:02

Message: 7 of 12

Dear David Kunik!

As Alan wrote already, STRREP is much faster that REGEXPREP.
In some tests with 1.5MB strings and 6000 occurrences of 'A_A', STRREP takes less than 0.1 sec, while REGEXPREP needs 36 sec.

I'm interested in time measurements of the perl method also!

Good luck, Jan

Subject: Script will take far too long..

From: Rune Allnor

Date: 21 Jul, 2009 23:56:19

Message: 8 of 12

On 21 Jul, 21:28, "David Kunik" <ku...@ualberta.ca> wrote:
> Hey,
>
> So i have a very large data set (1.9GB) so i am unable to load this in to Matlab for modification that way (this computer only has 1gb of ram).  I need to do a couple simple regexprep calls (A_A -> 1) which is easy enough but each row of data is like 1.5MB and it takes bloody forever for regexprep to go through (there is allot of A_A's..).  By "forever" i mean ~41 minutes.  For 1/624 rows.

There is something wrong. A 1GB computer should have no problems
whatsoever with handling lines of 1.5 MB.

Most likely, the file was generated by a different type of
computer than your - presumably - PC. If so, the lines are
ended by different characters than FGETS or FGETL look for.

I don't know how to configure FGETS or FGETL to change
End-of-Line characters, so the second best is if you know
how many characters there are in the line.

Or use some other computer and re-format the file.
If this is a text file, open it in MSWordPad and
then store it as a .txt file. That way, End-of-Line
characters are changed to what matlab can recognize.

Rune

Subject: Script will take far too long..

From: Jan Simon

Date: 22 Jul, 2009 00:24:01

Message: 9 of 12

Dear Rune Allnor!

> There is something wrong. A 1GB computer should have no problems
> whatsoever with handling lines of 1.5 MB.

Waiting 40 min for regexprep is not a "problem", but just slow.
Matlab 6.5 takes 10 min for replacing 125.000 'A_A' with '1' in a 1.5MB string on my 1500MHz PentiumM -- without file access!

> Most likely, the file was generated by a different type of
> computer than your - presumably - PC. If so, the lines are
> ended by different characters than FGETS or FGETL look for.

FGETL calls FGETS, and the later can handle all PC/MacOS9/Unix linebreaks without problems. Even FOPEN(RB) or (RT) does not matter, because FGETL cuts off the line break (of any style), FGETS had found.
Therefore the text file do not need a conversion.

Example:
fid = fopen('test.txt', 'wb');
fwrite(fid, ['Line1', 10, 'Line2', 13, 10, 'Line3', 13, 'END'], 'uchar');
fclose(fid);
fid = fopen(test.txt', 'rb')
fgetl(fid), fgetl(fid), fgetl(fid)
fclose(fid);
fid = fopen(test.txt', 'rt')
fgetl(fid), fgetl(fid), fgetl(fid)
fclose(fid);

So REGEXPREP -> STRREP or the nice perl trick should give enough speed.

Good night, Jan

Subject: Script will take far too long..

From: David Kunik

Date: 22 Jul, 2009 15:04:02

Message: 10 of 12

Thank you all for your help. My script is scooting along as we speak, using strrep. I have not been able to test the perl script on my data as this computer does not have perl installed and it is not mine.

Thanks again,
Dave.

Subject: Script will take far too long..

From: Rune Allnor

Date: 22 Jul, 2009 16:04:37

Message: 11 of 12

On 22 Jul, 02:24, "Jan Simon" <matlab.THIS_Y...@nMINUSsimon.de> wrote:
> Dear Rune Allnor!
>
> > There is something wrong. A 1GB computer should have no problems
> > whatsoever with handling lines of 1.5 MB.
>
> Waiting 40 min for regexprep is not a "problem", but just slow.

The time is a problem for several reasons:

1) It takes several orders of magnitudes more
   than it needs to (see below)
2) The time is the difference between the job
   getting done at all, or not.

> Matlab 6.5 takes 10 min for replacing 125.000 'A_A' with '1' in a 1.5MB string on my 1500MHz PentiumM -- without file access!

On R2006a:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 1500000;
s = char('B'*ones(1,N));
Naa = 200000;

for n=1:5:5*Naa
s(n:n+2)='A_A';
end

rexp = 'A_A';
tic
regexprep(s,rexp,'1');
toc

tic
strrep(s,'A_A','1');
toc
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Elapsed time is 65.978407 seconds.
Elapsed time is 0.018985 seconds.

So the OP should be able to do the whole job in a
couple of seconds. With his present computer.

Rune

Subject: Script will take far too long..

From: Ashish Uthama

Date: 22 Jul, 2009 16:37:18

Message: 12 of 12

On Wed, 22 Jul 2009 11:04:02 -0400, David Kunik <kunik@ualberta.ca> wrote:

> Thank you all for your help. My script is scooting along as we speak,
> using strrep. I have not been able to test the perl script on my data
> as this computer does not have perl installed and it is not mine.
>
> Thanks again,
> Dave.


STRREP should help.

Side note: You dont need to install Perl, MATLAB comes with it!

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
fail David Kunik 21 Jul, 2009 15:29:08
large dataset David Kunik 21 Jul, 2009 15:29:08
noob David Kunik 21 Jul, 2009 15:29:07
rssFeed for this Thread
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com