Thread Subject: really big data files

Subject: really big data files

From: Jon Shultz

Date: 8 Nov, 2009 19:24:02

Message: 1 of 5

I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each. I need to know how many lines are in the parent file first.

I have a routine now that does it like this:
totlines=0;
while ~feof(fid)
    line=fgetl(fid);
    totlines=totlines+1;
end

This does well with the memory part, but takes forever. There has got to be a more efficient way to do this, but I'm stuck.

Thanks!

Subject: really big data files

From: Rune Allnor

Date: 8 Nov, 2009 19:31:35

Message: 2 of 5

On 8 Nov, 20:24, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each.  I need to know how many lines are in the parent file first.  
>
> I have a routine now that does it like this:
> totlines=0;
> while ~feof(fid)
>     line=fgetl(fid);
>     totlines=totlines+1;
> end
>
> This does well with the memory part, but takes forever.  There has got to be a more efficient way to do this, but I'm stuck.

Read the file in larger batches than a single line.

Rune

Subject: really big data files

From: Jon Shultz

Date: 9 Nov, 2009 00:00:19

Message: 3 of 5

Rune Allnor <allnor@tele.ntnu.no> wrote in message <dfca570a-9e21-4622-bdea-69768c9d26b4@p8g2000yqb.googlegroups.com>...
> On 8 Nov, 20:24, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> > I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each. ?I need to know how many lines are in the parent file first. ?
> >
> > I have a routine now that does it like this:
> > totlines=0;
> > while ~feof(fid)
> > ? ? line=fgetl(fid);
> > ? ? totlines=totlines+1;
> > end
> >
> > This does well with the memory part, but takes forever. ?There has got to be a more efficient way to do this, but I'm stuck.
>
> Read the file in larger batches than a single line.
>
> Rune

Thank you. I am using textscan to get the data blocks in the code which follows what I have written above. Let me restate my question. Is there a way to determine the number of lines in a large file without reading in the data (which will crash Matlab)?

I want to use the total number of lines to determine the best way to segment the files.

Jon

Subject: really big data files

From: TideMan

Date: 9 Nov, 2009 01:31:37

Message: 4 of 5

On Nov 9, 1:00 pm, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> Rune Allnor <all...@tele.ntnu.no> wrote in message <dfca570a-9e21-4622-bdea-69768c9d2...@p8g2000yqb.googlegroups.com>...
> > On 8 Nov, 20:24, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> > > I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each. ?I need to know how many lines are in the parent file first. ?
>
> > > I have a routine now that does it like this:
> > > totlines=0;
> > > while ~feof(fid)
> > > ? ? line=fgetl(fid);
> > > ? ? totlines=totlines+1;
> > > end
>
> > > This does well with the memory part, but takes forever. ?There has got to be a more efficient way to do this, but I'm stuck.
>
> > Read the file in larger batches than a single line.
>
> > Rune
>
> Thank you.  I am using textscan to get the data blocks in the code which follows what I have written above.  Let me restate my question.  Is there a way to determine the number of lines in a large file without reading in the data (which will crash Matlab)?
>
> I want to use the total number of lines to determine the best way to segment the files.  
>
> Jon

Copy and paste these lines into a new file called CountLines.pl in
Matlab's path:
while (<>) {};
print $.,"\n";

Now, run it in Matlab like this:
perl('CountLines.pl',filename)
where filename is your file name.

Subject: really big data files

From: Jon Shultz

Date: 17 Nov, 2009 15:21:04

Message: 5 of 5

Excerpt from above:

> >?Let me restate my question. ?Is there a way to determine the number of lines in a large file without reading in the data (which will crash Matlab)?
> >
> > I want to use the total number of lines to determine the best way to segment the files. ?
> >
> > Jon
>
> Copy and paste these lines into a new file called CountLines.pl in
> Matlab's path:
> while (<>) {};
> print $.,"\n";
>
> Now, run it in Matlab like this:
> perl('CountLines.pl',filename)
> where filename is your file name.

Tideman, that worked great (and is about 100 times faster than the code I had above)...until I tried to access a file from a network location (\\abc-def-45\data...). Is there a perl command that will allow UNC file locations to be recognized?

Jon

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
unc path name Jon Shultz 17 Nov, 2009 10:24:10
perl Jon Shultz 17 Nov, 2009 10:24:10
fgetl Jon Shultz 8 Nov, 2009 14:29:06
data file Jon Shultz 8 Nov, 2009 14:29:06
fid Jon Shultz 8 Nov, 2009 14:29:06
rssFeed for this Thread
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com