Thread Subject: Fastest way to get the number of lines

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 25 Aug, 2008 20:41:03

Message: 1 of 11

I have a gigantic .csv file (about 7-9GB), which contains
about 6.5 million lines of numbers. Each row contains about
15,000 data in comma delimiter format.

Currently I am using TEXTSCAN to extract only the first
column to determine the number of lines in the file. It
took 4-5 hours on 3GHz pentium IV. Are there any better
solution to just get the number of lines? Thanks a lot.

I have already skip the other columns when reading.
col = textscan( fid, ['%f' repmat('%*f',1,14999)], -1,
'delimiter', ',');
numLines = length(col);

Thanks a lot in advance.


Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 25 Aug, 2008 20:49:02

Message: 2 of 11

Sorry - the file size is actually 37.0 GB!!

Thanks a lot in advance

Subject: Fastest way to get the number of lines

From: Walter Roberson

Date: 25 Aug, 2008 22:14:17

Message: 3 of 11

Pete sherer wrote:
> I have a gigantic .csv file (about 7-9GB), which contains
> about 6.5 million lines of numbers. Each row contains about
> 15,000 data in comma delimiter format.
>
> Currently I am using TEXTSCAN to extract only the first
> column to determine the number of lines in the file. It
> took 4-5 hours on 3GHz pentium IV. Are there any better
> solution to just get the number of lines?

If you are using Windows OS, use Matlab's perl interface: the perl
code is trivial and should be fairly fast:

For example, store the below two lines in countlines.pl

while (<>) {};
print $.,"\n";

Then to make a matlab call to count the lines for file XYZ.csv

numlines = str2num( perl('countlines.pl', 'XYZ.csv') );


If you are using any other OS, then you can skip the perl and use

[status, result] = system( ['wc -l ', 'XYZ.csv'] );
numlines = str2num( result );


Note: I don't promise that wc -l will work properly if there are more
than 2^31-1 lines in the file... not something I ever looked into.
The perl version should be able to handle up to 2^52-1 lines per file;
if you have more than that, then it becomes more difficult to get
an accurate line count through (but it should be possible with
some trickery.)

--
Q = quotation(rand);
if isempty(Q); error('Quotation server filesystem problems')
else sprintf('%s',Q), end

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 26 Aug, 2008 05:10:03

Message: 4 of 11

Thanks so much for the suggestion.

The perl code took about 3 hrs, while textscan TOOK 25 times
longer - 3+ DAYS!!!

Thanks so much.


Subject: Fastest way to get the number of lines

From: us

Date: 26 Aug, 2008 08:21:01

Message: 5 of 11

"Pete sherer":
<SNIP strange...

> Sorry - the file size is actually 37.0 GB...

how can that be?

% using your numbers from the first post
     37*2^30/6500000
% ans = 6112.1 % <- bytes/line
% but you also tell CSSM that

Each row contains about
15,000 data in comma delimiter format.

???
your file should be MUCH bigger than 37g - even if a line
looked like this

     n,n,n,n,...

can you be more specific on what you mean by a 15k comma
delimited data line looks like?

us

Subject: Fastest way to get the number of lines

From: pisz_na.mirek@dionizos.zind.ikem.pwr.wroc.pl

Date: 26 Aug, 2008 09:55:03

Message: 6 of 11

Pete sherer <tsh@abg.com> wrote:
> Thanks so much for the suggestion.
>
> The perl code took about 3 hrs, while textscan TOOK 25 times
> longer - 3+ DAYS!!!
>
> Thanks so much.

wc command in linux on my old Athlon 1600XP scans such file
in 20 seconds/GB with about 20% CPU usage.

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 16 Sep, 2008 21:55:03

Message: 7 of 11

Similarly instead of counting the number of lines in the file, can the PERL code be modified to find the first row (of the first column only) that contains a specified value?

For example, if the file looks like
2,45,56,7767,76,565.5,...
23,454,556,74767,476,5465.5,...
56,15,16,1767,176,1565.5,...
678,45,5,67,6,0.5,...
845,11,22,45,32,2.5,...
...

For example, I want to know the line number that the first column contains 678, then the line number should be 4.

Thanks so much in advance.
Pete

Subject: Fastest way to get the number of lines

From: Walter Roberson

Date: 16 Sep, 2008 22:17:05

Message: 8 of 11

Pete sherer wrote:
> Similarly instead of counting the number of lines in the file, can the PERL
> code be modified to find the first row (of the first column only) that
> contains a specified value?

> For example, if the file looks like
> 2,45,56,7767,76,565.5,...
> 23,454,556,74767,476,5465.5,...
> 56,15,16,1767,176,1565.5,...
> 678,45,5,67,6,0.5,...
> 845,11,22,45,32,2.5,...
> ...

> For example, I want to know the line number that the first column contains 678,
> then the line number should be 4.


Yes. For example, store the below two lines in findvalline.pl

$targetval = shift @ARGV;
while (<>) { /^$targetval,/ && do { print $.,"\n"; break } }

Then to make a matlab call to find the value N in file XYZ.csv

linenum = str2num( perl('findvalline.pl', num2str(N), 'XYZ.csv') );
if isemtpy(linenum); error('no match'); end


But be careful if your target is not an integer: num2str() will
not necessarily round or truncate the same way as is in the file.
You may wish to use a more sophisticated way of determining
the matching string than using num2str().

--
Q = quotation(rand);
if isempty(Q); error('Quotation server filesystem problems')
else sprintf('%s',Q), end

Subject: Fastest way to get the number of lines

From: E

Date: 17 Sep, 2008 11:28:02

Message: 9 of 11

Here's my take :

fh = fopen(filename, 'r');
chunksize = 1e6; % read chuncks of 1MB at a time
n2 = 0;
while ~feof(fh)
    ch = fread(fh, chunksize, '*uchar');
    if isempty(ch)
        break
    end
    numlines = numlines + sum(ch == sprintf('\n'));
end
fclose(fh);

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 17 Sep, 2008 15:50:18

Message: 10 of 11


Thank you so much Walter for your help to answer my request.

I am sorry for not putting the request all at once. Would it be possible to match 2 or more numbers? Like I would like to find the rows with the first column matching 2 and 678 numbers.

I can simply call the program twice, but for a huge file size, I think it's probably faster to do it inside the perl.

Thanks a lot in advance.
Pete

Subject: Fastest way to get the number of lines

From: Walter Roberson

Date: 18 Sep, 2008 19:54:08

Message: 11 of 11

Pete sherer wrote:

> I am sorry for not putting the request all at once. Would it be possible
> to match 2 or more numbers? Like I would like to find the rows with the
> first column matching 2 and 678 numbers.

> I can simply call the program twice, but for a huge file size, I think it's
> probably faster to do it inside the perl.

In findvallines.pl put these two lines:

$fn = pop @ARGV; $" = '|'; $targetpat = qr/^(?:@ARGV),/o; @ARGV = ($fn);
while (<>) { /$targetpat/ && do { print $.,"\n" } }


Example invocation:

>> linenum = str2num( perl('findvalline.pl', '23', '678', 'XYZ.csv') )

linenum =

     2
     4


If you wanted niceties such as printing out which of the lines matched what, you
should have specified.

Note that this code could be improved, because it no longer stops when it finds
a match. It doesn't even stop when it has found as many matches as there were
original numbers. (You didn't promise that all of the lines started with unique
values.) If the first column is unique, then stopping upon the last match would
be reasonably fast; if the first column is not unique but you only want to
report the first matching line for each pattern, then the code would have
to be more complicated and would slow down.

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread

Public Submission Policy

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Disclaimer prior to use.

Contact us at files@mathworks.com