Thread Subject: processing extremely long data file sequentially?

Subject: processing extremely long data file sequentially?

From: huhua

Date: 1 Mar, 2008 01:54:05

Message: 1 of 4

Hi all,

Let's say a CSV file has tens of millions lines and each line has many
columns.

I actually wanted to browse through it line by line (except the first line,
which is the headline),

and I need to cut most of the lines and columns out, and only use a few
lines and columns.

I am estimating that out of these tens of millions of lines, I only need to
retain tens of thousands of lines.

But I need to process them and cut the non-useful lines out.

Even Excel 2007 refused to load the file. Matlab crashed several times when
I tried to load.

What do I do?

Is there a "textread", "textscan", "csvread" file that can read it line by
line and sequentially?

I think it is important for the program to keep a relative pointer in the
CSV file so that after each line is read and processed, we can move to the
next line.

And I just need to sequentially write out another output file to take the
filtered lines.

Of course the benefit of "textread", "textscan", "csvread" is that they can
parse formated strings, including both text and numbers... that's
important...

Any ideas?

Thanks







Subject: processing extremely long data file sequentially?

From: Paul

Date: 1 Mar, 2008 03:30:20

Message: 2 of 4

"huhua" <lunamoonmoon@gmail.com> wrote in message
<fqacv8$et8$1@news.Stanford.EDU>...
> Hi all,
>
> Let's say a CSV file has tens of millions lines and each
line has many
> columns.
>
> I actually wanted to browse through it line by line
(except the first line,
> which is the headline),
>
> and I need to cut most of the lines and columns out, and
only use a few
> lines and columns.
>
> I am estimating that out of these tens of millions of
lines, I only need to
> retain tens of thousands of lines.
>
> But I need to process them and cut the non-useful lines out.
>
> Even Excel 2007 refused to load the file. Matlab crashed
several times when
> I tried to load.
>
> What do I do?
>
> Is there a "textread", "textscan", "csvread" file that can
read it line by
> line and sequentially?
>
> I think it is important for the program to keep a relative
pointer in the
> CSV file so that after each line is read and processed, we
can move to the
> next line.
>
> And I just need to sequentially write out another output
file to take the
> filtered lines.
>
> Of course the benefit of "textread", "textscan", "csvread"
is that they can
> parse formated strings, including both text and numbers...
that's
> important...
>
> Any ideas?
>
> Thanks
>
>
>
>
>
>
>

help fgetl

Subject: processing extremely long data file sequentially?

From: Andres Toennesmann

Date: 9 Mar, 2008 15:11:03

Message: 3 of 4

"huhua" <lunamoonmoon@gmail.com> wrote in message
<fqacv8$et8$1@news.Stanford.EDU>...
> Hi all,
>
> Let's say a CSV file has tens of millions lines and each
line has many
> columns.
>
> I actually wanted to browse through it line by line
(except the first line,
> which is the headline),
>
> and I need to cut most of the lines and columns out, and
only use a few
> lines and columns.

> []

If the csv contains mainly numeric data below the header
line, you may try txt2mat from the file exchange with its
'RowRange' and 'FilePos' arguments (see Help, esp. Example
5). This should be vastly quicker than fgetl.
Hth
Andres


Subject: processing extremely long data file sequentially?

From: NZTideMan

Date: 10 Mar, 2008 04:43:40

Message: 4 of 4

On Mar 10, 4:11=A0am, "Andres Toennesmann" <rant...@werb.de> wrote:
> "huhua" <lunamoonm...@gmail.com> wrote in message
>
> <fqacv8$et...@news.Stanford.EDU>...> Hi all,
>
> > Let's say a CSV file has tens of millions lines and each
> line has many
> > columns.
>
> > I actually wanted to browse through it line by line
>
> (except the first line,
>
> > which is the headline),
>
> > and I need to cut most of the lines and columns out, and
> only use a few
> > lines and columns.
> > []
>
> If the csv contains mainly numeric data below the header
> line, you may try txt2mat from the file exchange with its
> 'RowRange' and 'FilePos' arguments (see Help, esp. Example
> 5). This should be vastly quicker than fgetl.
> Hth
> Andres

I'd use Fortran, not Matlab for this job.
Fortran was developed back in the days of Hollerith cards, in which
you loaded one card of data at a time, so it can handle such a problem
easily and very fast.

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com