Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Fastest way to get the number of lines

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 25 Aug, 2008 20:41:03

Message: 1 of 13

I have a gigantic .csv file (about 7-9GB), which contains
about 6.5 million lines of numbers. Each row contains about
15,000 data in comma delimiter format.

Currently I am using TEXTSCAN to extract only the first
column to determine the number of lines in the file. It
took 4-5 hours on 3GHz pentium IV. Are there any better
solution to just get the number of lines? Thanks a lot.

I have already skip the other columns when reading.
col = textscan( fid, ['%f' repmat('%*f',1,14999)], -1,
'delimiter', ',');
numLines = length(col);

Thanks a lot in advance.

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 25 Aug, 2008 20:49:02

Message: 2 of 13

Sorry - the file size is actually 37.0 GB!!

Thanks a lot in advance

Subject: Fastest way to get the number of lines

From: Walter Roberson

Date: 25 Aug, 2008 22:14:17

Message: 3 of 13

Pete sherer wrote:
> I have a gigantic .csv file (about 7-9GB), which contains
> about 6.5 million lines of numbers. Each row contains about
> 15,000 data in comma delimiter format.
>
> Currently I am using TEXTSCAN to extract only the first
> column to determine the number of lines in the file. It
> took 4-5 hours on 3GHz pentium IV. Are there any better
> solution to just get the number of lines?

If you are using Windows OS, use Matlab's perl interface: the perl
code is trivial and should be fairly fast:

For example, store the below two lines in countlines.pl

while (<>) {};
print $.,"\n";

Then to make a matlab call to count the lines for file XYZ.csv

numlines = str2num( perl('countlines.pl', 'XYZ.csv') );


If you are using any other OS, then you can skip the perl and use

[status, result] = system( ['wc -l ', 'XYZ.csv'] );
numlines = str2num( result );


Note: I don't promise that wc -l will work properly if there are more
than 2^31-1 lines in the file... not something I ever looked into.
The perl version should be able to handle up to 2^52-1 lines per file;
if you have more than that, then it becomes more difficult to get
an accurate line count through (but it should be possible with
some trickery.)

--
Q = quotation(rand);
if isempty(Q); error('Quotation server filesystem problems')
else sprintf('%s',Q), end

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 26 Aug, 2008 05:10:03

Message: 4 of 13

Thanks so much for the suggestion.

The perl code took about 3 hrs, while textscan TOOK 25 times
longer - 3+ DAYS!!!

Thanks so much.

Subject: Fastest way to get the number of lines

From: us

Date: 26 Aug, 2008 08:21:01

Message: 5 of 13

"Pete sherer":
<SNIP strange...

> Sorry - the file size is actually 37.0 GB...

how can that be?

% using your numbers from the first post
     37*2^30/6500000
% ans = 6112.1 % <- bytes/line
% but you also tell CSSM that

Each row contains about
15,000 data in comma delimiter format.

???
your file should be MUCH bigger than 37g - even if a line
looked like this

     n,n,n,n,...

can you be more specific on what you mean by a 15k comma
delimited data line looks like?

us

Subject: Fastest way to get the number of lines

From: pisz_na.mirek@dionizos.zind.ikem.pwr.wroc.pl

Date: 26 Aug, 2008 09:55:03

Message: 6 of 13

Pete sherer <tsh@abg.com> wrote:
> Thanks so much for the suggestion.
>
> The perl code took about 3 hrs, while textscan TOOK 25 times
> longer - 3+ DAYS!!!
>
> Thanks so much.

wc command in linux on my old Athlon 1600XP scans such file
in 20 seconds/GB with about 20% CPU usage.

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 16 Sep, 2008 21:55:03

Message: 7 of 13

Similarly instead of counting the number of lines in the file, can the PERL code be modified to find the first row (of the first column only) that contains a specified value?

For example, if the file looks like
2,45,56,7767,76,565.5,...
23,454,556,74767,476,5465.5,...
56,15,16,1767,176,1565.5,...
678,45,5,67,6,0.5,...
845,11,22,45,32,2.5,...
...

For example, I want to know the line number that the first column contains 678, then the line number should be 4.

Thanks so much in advance.
Pete

Subject: Fastest way to get the number of lines

From: Walter Roberson

Date: 16 Sep, 2008 22:17:05

Message: 8 of 13

Pete sherer wrote:
> Similarly instead of counting the number of lines in the file, can the PERL
> code be modified to find the first row (of the first column only) that
> contains a specified value?

> For example, if the file looks like
> 2,45,56,7767,76,565.5,...
> 23,454,556,74767,476,5465.5,...
> 56,15,16,1767,176,1565.5,...
> 678,45,5,67,6,0.5,...
> 845,11,22,45,32,2.5,...
> ...

> For example, I want to know the line number that the first column contains 678,
> then the line number should be 4.


Yes. For example, store the below two lines in findvalline.pl

$targetval = shift @ARGV;
while (<>) { /^$targetval,/ && do { print $.,"\n"; break } }

Then to make a matlab call to find the value N in file XYZ.csv

linenum = str2num( perl('findvalline.pl', num2str(N), 'XYZ.csv') );
if isemtpy(linenum); error('no match'); end


But be careful if your target is not an integer: num2str() will
not necessarily round or truncate the same way as is in the file.
You may wish to use a more sophisticated way of determining
the matching string than using num2str().

--
Q = quotation(rand);
if isempty(Q); error('Quotation server filesystem problems')
else sprintf('%s',Q), end

Subject: Fastest way to get the number of lines

From: E

Date: 17 Sep, 2008 11:28:02

Message: 9 of 13

Here's my take :

fh = fopen(filename, 'r');
chunksize = 1e6; % read chuncks of 1MB at a time
n2 = 0;
while ~feof(fh)
    ch = fread(fh, chunksize, '*uchar');
    if isempty(ch)
        break
    end
    numlines = numlines + sum(ch == sprintf('\n'));
end
fclose(fh);

Subject: Fastest way to get the number of lines

From: Pete sherer

Date: 17 Sep, 2008 15:50:18

Message: 10 of 13


Thank you so much Walter for your help to answer my request.

I am sorry for not putting the request all at once. Would it be possible to match 2 or more numbers? Like I would like to find the rows with the first column matching 2 and 678 numbers.

I can simply call the program twice, but for a huge file size, I think it's probably faster to do it inside the perl.

Thanks a lot in advance.
Pete

Subject: Fastest way to get the number of lines

From: Walter Roberson

Date: 18 Sep, 2008 19:54:08

Message: 11 of 13

Pete sherer wrote:

> I am sorry for not putting the request all at once. Would it be possible
> to match 2 or more numbers? Like I would like to find the rows with the
> first column matching 2 and 678 numbers.

> I can simply call the program twice, but for a huge file size, I think it's
> probably faster to do it inside the perl.

In findvallines.pl put these two lines:

$fn = pop @ARGV; $" = '|'; $targetpat = qr/^(?:@ARGV),/o; @ARGV = ($fn);
while (<>) { /$targetpat/ && do { print $.,"\n" } }


Example invocation:

>> linenum = str2num( perl('findvalline.pl', '23', '678', 'XYZ.csv') )

linenum =

     2
     4


If you wanted niceties such as printing out which of the lines matched what, you
should have specified.

Note that this code could be improved, because it no longer stops when it finds
a match. It doesn't even stop when it has found as many matches as there were
original numbers. (You didn't promise that all of the lines started with unique
values.) If the first column is unique, then stopping upon the last match would
be reasonably fast; if the first column is not unique but you only want to
report the first matching line for each pattern, then the code would have
to be more complicated and would slow down.

Subject: Fastest way to get the number of lines

From: James McCloskey

Date: 13 Apr, 2011 10:45:07

Message: 12 of 13

Hey Walter,

The countlines script here seems to work great for me but I'm trying to use the findvalue script and keep gettin NaN as a result using your code (I think perl is returning an empty field).

I'm looking for a line in my datafile where the first (and only column in this particular line) contains the string "CommentsData" should this work for your script?

I'm trying to export data from a text file up-uptill the comments are appended and this would be a fast way to find the line number and initialize my variables (and dimensions etc) before beginning exporting data.

Thanks

Jim

Walter Roberson <roberson@hushmail.com> wrote in message <plyAk.10596$Il.10480@newsfe09.iad>...
> Pete sherer wrote:
>
> > I am sorry for not putting the request all at once. Would it be possible
> > to match 2 or more numbers? Like I would like to find the rows with the
> > first column matching 2 and 678 numbers.
>
> > I can simply call the program twice, but for a huge file size, I think it's
> > probably faster to do it inside the perl.
>
> In findvallines.pl put these two lines:
>
> $fn = pop @ARGV; $" = '|'; $targetpat = qr/^(?:@ARGV),/o; @ARGV = ($fn);
> while (<>) { /$targetpat/ && do { print $.,"\n" } }
>
>
> Example invocation:
>
> >> linenum = str2num( perl('findvalline.pl', '23', '678', 'XYZ.csv') )
>
> linenum =
>
> 2
> 4
>
>
> If you wanted niceties such as printing out which of the lines matched what, you
> should have specified.
>
> Note that this code could be improved, because it no longer stops when it finds
> a match. It doesn't even stop when it has found as many matches as there were
> original numbers. (You didn't promise that all of the lines started with unique
> values.) If the first column is unique, then stopping upon the last match would
> be reasonably fast; if the first column is not unique but you only want to
> report the first matching line for each pattern, then the code would have
> to be more complicated and would slow down.

Subject: Fastest way to get the number of lines

From: James McCloskey

Date: 13 Apr, 2011 14:45:10

Message: 13 of 13

Just figured it out... the regular experssion /^$targetval,/ was appending a comma to the end of the string...

Works now

Thanks!

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us