Thread Subject: Reading text from txt file, fixed characters

Subject: Reading text from txt file, fixed characters

From: Nuno Martins

Date: 6 Dec, 2011 14:58:08

Message: 1 of 6

Hello everybody,

I have several data files in the following format:
...
  -0.14 -1.82 -0.46 13.56
  -0.13 -1.89 -0.44 13.54
  -0.01 -2.04 -0.65 13.56
   0.01 -2.04 -0.63 13.56
   0.10 -2.14 -0.69 13.56
   0.06 -2.08 -0.67 13.58
...

where each file has roughly 6000 lines. Each line has precisely 30 characters, including all white spaces before and between values. The first, second and third column have always 7 characters, and the last one is composed by 8 characters, independently of the value.
My main objective is to import this data to Matlab in order to process it, but it happens that eventually the strings contain some anomalies. Like the following:
...
   0.67 -2.02 0.10 14.32
   0.51 -2.07 0.13 14.32
   0.64 -2.02 0.1 -2.34 -0.09 14.36
   0.81 -2.40 -0.16 14.33
   0.78 -2.35 -0.26 14.36
...
or
...
   0.92 -2.34 0.13 14.24
   1.15 -2.34 0.12 14.23
   1.36 -2.31 14.16
   1.45 -2.19 -0.05 14.16
   1.25 -2.41 0.13 14.16
...

So, my main goal, in a first approach, is to identify the lines where this anomaly happens and try to correct it, by interpolation of the values in the adjacent strings.

How is it possible to import the data and identify the bad strings? (When I read the text with textscan I don't get the white characters before the first value and so I cannot count the correct number of characters in each string.)

Can anyone recommend me a way of repairing each string in an automatic way?

Thank you all in advance

Subject: Reading text from txt file, fixed characters

From: dpb

Date: 6 Dec, 2011 15:31:21

Message: 2 of 6

On 12/6/2011 8:58 AM, Nuno Martins wrote:
...

> How is it possible to import the data and identify the bad strings?

fgetl() to read and then various string handling functions to query the
content (length, values, etc., etc., etc., ...)

...

> Can anyone recommend me a way of repairing each string in an automatic way?
...

As a first easy step you could simply read each line and discard those
that don't have the proper length.

Beyond that, you're into quite a lot of logic to figure out which
value(s) are the wrong ones and where they might go and all...

For example, in the first case you give it looks like the fourth column
of the long line and the first of the next line are missing and the
fixup would be to insert the missing values and the newline (as well as
perhaps to fixup the (apparently) missing last digit of the third column
if want to try that level of detail).

It would be relatively trivial to simply delete the line and interpolate
all values; recognizing automagically by comparing values which are the
missing would be more complex by quite a lot (particularly when you
consider all the possible ways the values could be munged--the above
case is really pretty easy in recognizing the problem; many others could
be more difficult by far.

As an aside, had a problem much like this many, many years ago when our
only way of retrieving plant process control data was punch paper tape
which was notoriously bad for such dropout. The final program for
reading such tapes and making such corrections ended up being very
complex indeed. It could in the end do quite a lot but it was a long
time in getting to that point.

--

Subject: Reading text from txt file, fixed characters

From: Nuno Martins

Date: 6 Dec, 2011 15:43:08

Message: 3 of 6

dpb <none@non.net> wrote in message <jblcge$d8a$1@speranza.aioe.org>...
> On 12/6/2011 8:58 AM, Nuno Martins wrote:
> ...
>
> > How is it possible to import the data and identify the bad strings?
>
> fgetl() to read and then various string handling functions to query the
> content (length, values, etc., etc., etc., ...)
>
> ...
>
> > Can anyone recommend me a way of repairing each string in an automatic way?
> ...
>
> As a first easy step you could simply read each line and discard those
> that don't have the proper length.
>
> Beyond that, you're into quite a lot of logic to figure out which
> value(s) are the wrong ones and where they might go and all...
>
> For example, in the first case you give it looks like the fourth column
> of the long line and the first of the next line are missing and the
> fixup would be to insert the missing values and the newline (as well as
> perhaps to fixup the (apparently) missing last digit of the third column
> if want to try that level of detail).
>
> It would be relatively trivial to simply delete the line and interpolate
> all values; recognizing automagically by comparing values which are the
> missing would be more complex by quite a lot (particularly when you
> consider all the possible ways the values could be munged--the above
> case is really pretty easy in recognizing the problem; many others could
> be more difficult by far.
>
> As an aside, had a problem much like this many, many years ago when our
> only way of retrieving plant process control data was punch paper tape
> which was notoriously bad for such dropout. The final program for
> reading such tapes and making such corrections ended up being very
> complex indeed. It could in the end do quite a lot but it was a long
> time in getting to that point.
>
> --

The biggest problem with the repair of the strings is that I am interested in the spectral content of the data, so the elimination of the bad string is not an option, as I would loose spectral information.

Thank you for your answer.

Subject: Reading text from txt file, fixed characters

From: dpb

Date: 6 Dec, 2011 17:47:42

Message: 4 of 6

On 12/6/2011 9:43 AM, Nuno Martins wrote:
...

> the bad string is not an option, as I would loose spectral information.
...

You've already lost the information; nothing will recreate the actual
unique information content lost only replace the entity perhaps.

--

Subject: Reading text from txt file, fixed characters

From: Branko

Date: 7 Dec, 2011 18:35:08

Message: 5 of 6

"Nuno Martins" <nuno.em@gmail.com> wrote in message <jbld6c$5lq$1@newscl01ah.mathworks.com>...
> dpb <none@non.net> wrote in message <jblcge$d8a$1@speranza.aioe.org>...
> > On 12/6/2011 8:58 AM, Nuno Martins wrote:
> > ...
> >
> > > How is it possible to import the data and identify the bad strings?
> >
> > fgetl() to read and then various string handling functions to query the
> > content (length, values, etc., etc., etc., ...)
> >
> > ...
> >
> > > Can anyone recommend me a way of repairing each string in an automatic way?
> > ...
> >
> > As a first easy step you could simply read each line and discard those
> > that don't have the proper length.
> >
> > Beyond that, you're into quite a lot of logic to figure out which
> > value(s) are the wrong ones and where they might go and all...
> >
> > For example, in the first case you give it looks like the fourth column
> > of the long line and the first of the next line are missing and the
> > fixup would be to insert the missing values and the newline (as well as
> > perhaps to fixup the (apparently) missing last digit of the third column
> > if want to try that level of detail).
> >
> > It would be relatively trivial to simply delete the line and interpolate
> > all values; recognizing automagically by comparing values which are the
> > missing would be more complex by quite a lot (particularly when you
> > consider all the possible ways the values could be munged--the above
> > case is really pretty easy in recognizing the problem; many others could
> > be more difficult by far.
> >
> > As an aside, had a problem much like this many, many years ago when our
> > only way of retrieving plant process control data was punch paper tape
> > which was notoriously bad for such dropout. The final program for
> > reading such tapes and making such corrections ended up being very
> > complex indeed. It could in the end do quite a lot but it was a long
> > time in getting to that point.
> >
> > --
>
> The biggest problem with the repair of the strings is that I am interested in the spectral content of the data, so the elimination of the bad string is not an option, as I would loose spectral information.
>
> Thank you for your answer.

doc regexp

Branko

Subject: Reading text from txt file, fixed characters

From: Rune Allnor

Date: 7 Dec, 2011 18:43:49

Message: 6 of 6

On 6 Des, 16:43, "Nuno Martins" <nuno...@gmail.com> wrote:

> The biggest problem with the repair of the strings is that I am interested in the spectral content of the data, so the elimination of the bad string is not an option, as I would loose spectral information.

The data are already broken, as either the
data aqcuisition system or logging system
were unable to keep up with the data stream.

That's an issue to discuss with whoever
specified the system. A lot of people seem
to think - erronuisly! - that bad data can
be corrected in post processing. If that's
the case here, you'd better find a different
project, and fast.

The question now is not how to 'repair' your
data, but how to minimize the damage.

It depends on how many non-compliant lines
you have: If there are only a handful, say,
a dozen or so, per file, then skip the bad
lines and use the rest of the data as is.

If there are a lot of broken lines the
question becomes whether whatever analysis
you are up to, is worth the while at all.

Some times no answer whatsoever is better
than a bad answer.

Rune

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
regexp Branko 7 Dec, 2011 13:39:12
character Nuno Martins 6 Dec, 2011 09:59:11
characters Nuno Martins 6 Dec, 2011 09:59:11
data file Nuno Martins 6 Dec, 2011 09:59:11
txt Nuno Martins 6 Dec, 2011 09:59:11
import data Nuno Martins 6 Dec, 2011 09:59:11
string Nuno Martins 6 Dec, 2011 09:59:11
read text Nuno Martins 6 Dec, 2011 09:59:11
rssFeed for this Thread

Contact us at files@mathworks.com