Thread Subject: really big data files

Subject: really big data files

From: Jon Shultz

Date: 8 Nov, 2009 19:24:02

Message: 1 of 5

I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each. I need to know how many lines are in the parent file first.

I have a routine now that does it like this:
totlines=0;
while ~feof(fid)
    line=fgetl(fid);
    totlines=totlines+1;
end

This does well with the memory part, but takes forever. There has got to be a more efficient way to do this, but I'm stuck.

Thanks!

Subject: really big data files

From: Rune Allnor

Date: 8 Nov, 2009 19:31:35

Message: 2 of 5

On 8 Nov, 20:24, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each.  I need to know how many lines are in the parent file first.  
>
> I have a routine now that does it like this:
> totlines=0;
> while ~feof(fid)
>     line=fgetl(fid);
>     totlines=totlines+1;
> end
>
> This does well with the memory part, but takes forever.  There has got to be a more efficient way to do this, but I'm stuck.

Read the file in larger batches than a single line.

Rune

Subject: really big data files

From: Jon Shultz

Date: 9 Nov, 2009 00:00:19

Message: 3 of 5

Rune Allnor <allnor@tele.ntnu.no> wrote in message <dfca570a-9e21-4622-bdea-69768c9d26b4@p8g2000yqb.googlegroups.com>...
> On 8 Nov, 20:24, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> > I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each. ?I need to know how many lines are in the parent file first. ?
> >
> > I have a routine now that does it like this:
> > totlines=0;
> > while ~feof(fid)
> > ? ? line=fgetl(fid);
> > ? ? totlines=totlines+1;
> > end
> >
> > This does well with the memory part, but takes forever. ?There has got to be a more efficient way to do this, but I'm stuck.
>
> Read the file in larger batches than a single line.
>
> Rune

Thank you. I am using textscan to get the data blocks in the code which follows what I have written above. Let me restate my question. Is there a way to determine the number of lines in a large file without reading in the data (which will crash Matlab)?

I want to use the total number of lines to determine the best way to segment the files.

Jon

Subject: really big data files

From: TideMan

Date: 9 Nov, 2009 01:31:37

Message: 4 of 5

On Nov 9, 1:00 pm, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> Rune Allnor <all...@tele.ntnu.no> wrote in message <dfca570a-9e21-4622-bdea-69768c9d2...@p8g2000yqb.googlegroups.com>...
> > On 8 Nov, 20:24, "Jon Shultz" <jjddshu...@yahoo.com> wrote:
> > > I'm trying to read in a datafile that's really big (>2GB) in sections that are a couple hundred thousand lines long each. ?I need to know how many lines are in the parent file first. ?
>
> > > I have a routine now that does it like this:
> > > totlines=0;
> > > while ~feof(fid)
> > > ? ? line=fgetl(fid);
> > > ? ? totlines=totlines+1;
> > > end
>
> > > This does well with the memory part, but takes forever. ?There has got to be a more efficient way to do this, but I'm stuck.
>
> > Read the file in larger batches than a single line.
>
> > Rune
>
> Thank you.  I am using textscan to get the data blocks in the code which follows what I have written above.  Let me restate my question.  Is there a way to determine the number of lines in a large file without reading in the data (which will crash Matlab)?
>
> I want to use the total number of lines to determine the best way to segment the files.  
>
> Jon

Copy and paste these lines into a new file called CountLines.pl in
Matlab's path:
while (<>) {};
print $.,"\n";

Now, run it in Matlab like this:
perl('CountLines.pl',filename)
where filename is your file name.

Subject: really big data files

From: Jon Shultz

Date: 17 Nov, 2009 15:21:04

Message: 5 of 5

Excerpt from above:

> >?Let me restate my question. ?Is there a way to determine the number of lines in a large file without reading in the data (which will crash Matlab)?
> >
> > I want to use the total number of lines to determine the best way to segment the files. ?
> >
> > Jon
>
> Copy and paste these lines into a new file called CountLines.pl in
> Matlab's path:
> while (<>) {};
> print $.,"\n";
>
> Now, run it in Matlab like this:
> perl('CountLines.pl',filename)
> where filename is your file name.

Tideman, that worked great (and is about 100 times faster than the code I had above)...until I tried to access a file from a network location (\\abc-def-45\data...). Is there a perl command that will allow UNC file locations to be recognized?

Jon

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
unc path name Jon Shultz 17 Nov, 2009 10:24:10
perl Jon Shultz 17 Nov, 2009 10:24:10
fgetl Jon Shultz 8 Nov, 2009 14:29:06
data file Jon Shultz 8 Nov, 2009 14:29:06
fid Jon Shultz 8 Nov, 2009 14:29:06
rssFeed for this Thread

Contact us at files@mathworks.com