Thread Subject:
Reading file: 1 column of text in middle of 35 colums of numbers

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: Eli

Date: 28 Jun, 2012 16:19:07

Message: 1 of 11

I am trying to load a file with a mixture of text and data. There are 88 header lines (all need to be skipped) and then on line 89, the data begins separated by commas. The data consists of 35 columns and 976 rows. However, column 34 is text and contains the word "dtest_122.spe".

As an example, the first 4 lines (after the header) of data are shown below:

1, 34.8400, -15.8200, 34.8401, -15.8200, 49, 66, 70, 45, 125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 185.889, 1.0001, 20.0005, 19.8005, 40741, 39022, 0, 30558, 16651, 22097, 21779, 4452, dtest_122.spe, 0

1, 34.8500, -15.8200, 34.8500, -15.8200, 72, 97, 75, 46, 126, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 185.833, 1.00434, 20.0005, 19.7996, 40674, 38962, 0, 30516, 16653, 22003, 21711, 4458, dtest_122.spe, 4479

1, 34.8600, -15.8200, 34.8600, -15.8200, 117, 75, 71, 48, 123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 185.775, 0.983823, 20.0005, 19.8037, 40453, 38797, 0, 30332, 16528, 21903, 21753, 4475, dtest_122.spe, 8945

1, 34.8700, -15.8200, 34.8700, -15.8200, 83, 62, 83, 38, 113, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 185.716, 0.988099, 20.0005, 19.8029, 40513, 38799, 0, 30427, 16611, 22084, 21687, 4467, dtest_122.spe, 13416


PROBLEM:
I need to extract the numeric data and put it in a matrix (976 X 34).
--------------------------X----------------------------

My attempts:
Attempt 1. Using textread:
The file name is 6LT.001.dat, so I used this:
aaaa=textread('6LT.001.dat','%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %f',-1,'delimiter',',','headerlines',88);

Error using dataread
Number of outputs must match the number of unskipped input fields.

Error in textread (line 176)
[varargout{1:nlhs}]=dataread('file',varargin{:});

Attempt 2. Using textread:
I manually opened the file (6LT.001.dat) and deleted column number 34 (with the word "dtest_122.spe" in it. I saved the file as 6LT.0011.dat and tried this:

aa=textread('6LT.0011.dat','',-1,'delimiter',',','headerlines',88);

This textread command worked and returned aa with a size of 976 X 34. This is what I want, but I can't go in manually and delete that column of text for every file. I have many such files.

Attempt 3. Using textscan (with the original file, 6LT.001.dat, that contains the column with the word "dtest_122.spe"):

fid=fopen('6LT.001.dat','rt');
val=textscan(fid,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %f ','delimiter',',','headerlines',88,'CollectOutput'true);
 val=textscan(fid,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s %f ','delimiter',',','headerlines',88,'CollectOutput'true);
                                                                                                                                                                             |
Error: Unexpected MATLAB expression.
 
fclose(fid);
data=cat(1,val{:});
Undefined variable "val" or class "val".

--------------X-------------

Is there some way that I can modify the commands in Attempt 1 or Attempt 3 to get it to read in the data from the original file (6LT.001.dat)?

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: dpb

Date: 28 Jun, 2012 17:32:56

Message: 2 of 11

On 6/28/2012 11:19 AM, Eli wrote:
> I am trying to load a file with a mixture of text and data. There are 88
> header lines (all need to be skipped) and then on line 89, the data
> begins separated by commas. The data consists of 35 columns and 976
> rows. However, column 34 is text and contains the word "dtest_122.spe".
>
> As an example, the first 4 lines (after the header) of data are shown
> below:
>
> 1, 34.8400, -15.8200, 34.8401, -15.8200, 49, 66, 70, 45, 125, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 185.889, 1.0001, 20.0005, 19.8005, 40741, 39022,
> 0, 30558, 16651, 22097, 21779, 4452, dtest_122.spe, 0
...

> I need to extract the numeric data and put it in a matrix (976 X 34).
> --------------------------X----------------------------
>
> My attempts:
...

> Attempt 3. Using textscan (with the original file, 6LT.001.dat, that
> contains the column with the word "dtest_122.spe"):

fmt=[repmat('%f ',1,34) 'dtest_122.spe' '%f'];
fid=fopen('6LT.001.dat','rt');
val=textscan(fid, fmt, ...
              'delimiter',',', ...
              'headerlines',88, ...
              'CollectOutput'true);

should work.

I don't have a version that includes textscan() but I think there's
still a problem in handling '%*s' in that it doesn't "know" about
fields; it just goes. If this is not so, then you should be able to
replace the literal string matching but I don't think it'll work.

I'm presuming you know the specific string a priori and so can build it
into the construction of the format string; if you must learn it on the
fly then you'll have to read the first line w/ fgetl() and parse it to
find the appropriate string to use for any given case.

As a sample of a format that I can test,

 > l=['0, 30558, 16651, 22097, 21779, 4452, dtest_122.spe, 0'];
 >> fmt=[repmat('%f, ',1,6) 'dtest_122.spe, ' '%f'];
 >> sscanf(l,fmt)
ans =
            0
        30558
        16651
        22097
        21779
         4452
            0
 >>

Oh, I just had a brainstorm...the ',' delimiter may help; I don't know
how it'll work using the 'delimiter' option but...

 >> fmt=[repmat('%f, ',1,6) '%*s, ' '%f'];
 >> fmt
fmt =
%f, %f, %f, %f, %f, %f, %*s, %f
 >> sscanf(l,fmt)
ans =
            0
        30558
        16651
        22097
        21779
         4452
 >>

The search for the matching ',' explicitly _did_ successfully truncate
the string field skip...maybe textscan() will as well???

--

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: Eli

Date: 28 Jun, 2012 18:39:07

Message: 3 of 11

Here is what I am getting:

>> fmt=[repmat('%f ',1,34) 'dtest_122.spe' '%f'];
>> fid=fopen('6LT.001.dat','rt');
>> val=textscan(fid, fmt,'delimiter',',','headerlines',88,'CollectOutput',true);

>> val

val =

    [1x35 double]


I need to see the values of val. Also it should be 976 X 34. I think I might be doing something wrong?
-----------------------X--------------------------

For the *s part, here is what I get:

>> clear
>> clc
>> l=['0, 30558, 16651, 22097, 21779, 4452, dtest_122.spe, 0'];
>> fmt=[repmat('%f, ',1,6) '%*s, ' '%f'];
>> sscanf(l,fmt)

ans =

           0
       30558
       16651
       22097
       21779
        4452


>> textscan(l,fmt)

ans =

  Columns 1 through 6

    [0] [30558] [16651] [22097] [21779] [4452]

  Column 7

    [0x1 double]

It recognizes the string but, as above, it isn't in the format required (976 X 34). Again though, am I doing something wrong?

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: dpb

Date: 28 Jun, 2012 19:22:06

Message: 4 of 11

On 6/28/2012 1:39 PM, Eli wrote:
> Here is what I am getting:
>
>>> fmt=[repmat('%f ',1,34) 'dtest_122.spe' '%f'];
>>> fid=fopen('6LT.001.dat','rt');
>>> val=textscan(fid,
>>> fmt,'delimiter',',','headerlines',88,'CollectOutput',true);
>
>>> val
>
> val =
> [1x35 double]
>
>
> I need to see the values of val. Also it should be 976 X 34. I think I
> might be doing something wrong?
> -----------------------X--------------------------

Well, for one thing you're not checking up on me... :)

fmt=[repmat('%f ',1,33) 'dtest_122.spe' '%f'];

I had that there were 35 resulting values instead of 34 in my head so
there's an extra conversion in the format string. READ what is written
and THINK about it; don't just cut 'n paste blindly...I don't try to do
these on purpose, but it is your dataset after all... :)

The values returned from textscan are collected in cell arrays--to see
the contents of the cell you have to dereference the cell w/ the curly
brackets {}. The ordinary parentheses simply give the size/type of the
cell array itself.

Fix the format and try again and then look at the dererenced cell
contents...

> For the *s part, here is what I get:
>
>>> l=['0, 30558, 16651, 22097, 21779, 4452, dtest_122.spe, 0'];
>>> fmt=[repmat('%f, ',1,6) '%*s, ' '%f'];
>>> sscanf(l,fmt)
>
> ans =
>
> 0
> 30558
> 16651
> 22097
> 21779
> 4452
>
>
>>> textscan(l,fmt)
>
> ans =
> Columns 1 through 6
>
> [0] [30558] [16651] [22097] [21779] [4452]
>
> Column 7
>
> [0x1 double]
>
> It recognizes the string but, as above, it isn't in the format required
> (976 X 34). Again though, am I doing something wrong?

Well, it doesn't--I misspoke as I failed to notice it did _NOT_
successfully convert the trailing '0' after the skipped string--it
illustrates the problem of the '%*s' in trying to skip a string field.
It's another terrible flaw in the way C i/o scanning works but afaict
there's nothing to be done about it other than the kludgy workarounds of
having to recognize a specific character-matching string or parsing a
line for tokens and converting them independently.

--

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: Eli

Date: 2 Jul, 2012 18:47:08

Message: 5 of 11

^^^^^Just wanted to update this, and post a related question after that:
It worked exactly as you said. I mistakenly assumed that the 34 was after all the extraction was complete. It wasn't though. And it should have been 33, as you indicated.
Also, the %s part worked as well - i.e. there was no need, with textscan(), to manually type in the string. The string was automatically detected. Thanks for this help.
--------------------------------X---------------------------
I am tyring to distinguish between the following 2 filenames:

6LT.001.dat
6LT.001_spectra1.dat

For the second filename, as you can tell, it is a file that I have manually generated. The original (input) file was 6LT.001_spectra. I processed it and produced 6LT.001_spectra1 (i.e. I appended the 1 at the end of it).

>> name
name =
6LT.001


>> name2
name2 =
6LT.001_spectra1

Working on name2, I have tried regexp with no luck:
aaa=regexp(name2,'*.\d*_spectra1','tokens');

1. Is there a way to extract 6LT001 ((that's 6LT.001, without the period)) from this variable name2?

2. Also, is there a way for me to distinguish between these two filenames?

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: dpb

Date: 2 Jul, 2012 19:34:41

Message: 6 of 11

On 7/2/2012 1:47 PM, Eli wrote:
...

> Also, the %s part worked as well - i.e. there was no need, with
> textscan(), to manually type in the string. The string was automatically
> detected. Thanks for this help.

That's good to know that TMW has finally managed to get that to work in
at least one function...I've been complaining about it for 15 years or
so... :)

> --------------------------------X---------------------------
> I am tyring to distinguish between the following 2 filenames:
>
> 6LT.001.dat
> 6LT.001_spectra1.dat
>
...

> processed it and produced 6LT.001_spectra1 (i.e. I appended the 1 at the
> end of it).
>
>>> name
> name =
> 6LT.001
>
>
>>> name2
> name2 =
> 6LT.001_spectra1
>
> Working on name2, I have tried regexp with no luck:
> aaa=regexp(name2,'*.\d*_spectra1','tokens');
>
> 1. Is there a way to extract 6LT001 ((that's 6LT.001, without the
> period)) from this variable name2?

I don't have regexp in my version of ML so I don't even try to get the
nuances of its syntax vis a vis that with which I am familiar...you're
on your own or post a separate thread on regexp if you want help on it,
specifically.

If this is a fixed-length name it's pretty simple...

strrep(nameX(1:7),'.','')

or, really is even if it isn't...

strrep(nameX(1:7),'.','')

strrep(n(1:findstr(n,'_')-1),'.','')

> 2. Also, is there a way for me to distinguish between these two filenames?

"Distinguish", how, specifically? That they're not the same is pretty clear

strcmp(name,name2) % will return false unless are same length and match

If you mean determine which one contains the added on stuff, like the
word 'spectra', then look for it

findstr(name2,'spectra')>0 % return true if 'spectra' is embedded

If you mean the numeric field added beyond '6LT.001_spectra' to make
'6LT.001_spectra1', then

length(nameX)>findstr(nameY,'spectra')+length('spectra')

will return T if there is a trailing character after the substring
spectra in the nameY (and the preceding lengths up to that are the
same). That's sufficient if you're always working on a match up to that
point; if it could be any other file that may also be of same length
name, then you need to do the strcmp() test on that N characters first
-- see strncmp() and friends for that...

Again, I'd suggest under Windows to not use extra dots but underscores
as the OS doesn't really know anything about additional fields in a
filename other than name.ext and I'd also again recommend that if you
are going to use a series of numerals to identify a series of files that
you create a field of some specific width and use leading zeros so that
directory name sorting will provide a logical alphabetic sort order. If
the number of revisions can _NEVER_ exceed 0 thru 9, then you can get by
as is, but it would seem only prudent to at least assign two digits 'cuz
as soon as you think you'll never need it, then's when the case will
immediately arise and you'll rue the day you decided not to... :)

--

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: Eli

Date: 2 Jul, 2012 22:47:07

Message: 7 of 11

> If this is a fixed-length name it's pretty simple...
>
> strrep(nameX(1:7),'.','')
>
> or, really is even if it isn't...
>
> strrep(nameX(1:7),'.','')
>
> strrep(n(1:findstr(n,'_')-1),'.','')

The first command is working fine. The next 2 are of importance to me though - it isn't a fixed-length name. With this in mind, do I need the second of the 3 commands? Also, what is n......for the example 6LTTTT.001, I guess n would be 10. How will it know this....is there some way to specify the name of the string?


> length(nameX)>findstr(nameY,'spectra')+length('spectra')
>
> will return T if there is a trailing character after the substring
> spectra in the nameY (and the preceding lengths up to that are the
> same).

I tried:
length(nameX)>findstr(nameX,'spectra')+length('spectra')

and it worked. i.e. I replaced your nameY with nameX and it worked. If this was just luck, then what is the difference between nameX and nameY?

PS: It gave a message that findstr will be replaced by strfind. So I made this change when I tested both of the length(nameX) commands.

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: dpb

Date: 3 Jul, 2012 00:31:02

Message: 8 of 11

On 7/2/2012 5:47 PM, Eli wrote:
>> If this is a fixed-length name it's pretty simple...
>>
>> strrep(nameX(1:7),'.','')
>>
>> or, really is even if it isn't...
...
> The first command is working fine. The next 2 are of importance to me
> though - it isn't a fixed-length name. With this in mind, do I need the
> second of the 3 commands? Also, what is n......for the example
> 6LTTTT.001, I guess n would be 10. How will it know this....is there
> some way to specify the name of the string?

Oh, I cut 'n pasted some sample code from my command window--there I was
using 'n' as a filename. I forgot to edit in the reply window. The
variable length assuming there's an underscore version was supposed to be

strrep(nameX(1:findstr(nameX,'_')-1),'.','')

>
>> length(nameX)>findstr(nameY,'spectra')+length('spectra')
>>
>> will return T if there is a trailing character after the substring
>> spectra in the nameY (and the preceding lengths up to that are the same).
>
> I tried:
> length(nameX)>findstr(nameX,'spectra')+length('spectra')
>
> and it worked. i.e. I replaced your nameY with nameX and it worked. If
> this was just luck, then what is the difference between nameX and nameY?

It was simply a choice of two file names which you wanted to compare.
You'll have to "salt to suit" to fit the actual kind of test you're
trying to make--it wasn't clear to me what that was, exactly, so I tried
to just give an example "go-by" that you could use as a starting point.

> PS: It gave a message that findstr will be replaced by strfind. So I
> made this change when I tested both of the length(nameX) commands.

Yeah, that postdates my Matlab release so I can only use findstr here...

--

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: Eli

Date: 3 Jul, 2012 19:36:07

Message: 9 of 11

> Oh, I cut 'n pasted some sample code from my command window--there I was
> using 'n' as a filename. I forgot to edit in the reply window. The
> variable length assuming there's an underscore version was supposed to be
>
> strrep(nameX(1:findstr(nameX,'_')-1),'.','')
>

Inorder to extract the 6LT001 from 6LT.001_spectra.dat, I used this:
strrep(nameX(1:findstr(nameX,'_')-1),'.','')

Inorder to extract the 6LT001 from 6LT.001.dat, I used this:
strrep(nameX(1:findstr(nameX,'.dat')-1),'.','')

Both worked. Is the second command functioning as intended? I mean, is the '.dat' the way to go or is there some other indicator where I can stop (with the first command, it stopped at the underscore)?

> >> length(nameX)>findstr(nameY,'spectra')+length('spectra')
> >>
> >> will return T if there is a trailing character after the substring
> >> spectra in the nameY (and the preceding lengths up to that are the same).
> >
> > I tried:
> > length(nameX)>findstr(nameX,'spectra')+length('spectra')
> >
> > and it worked. i.e. I replaced your nameY with nameX and it worked. If
> > this was just luck, then what is the difference between nameX and nameY?
>
> It was simply a choice of two file names which you wanted to compare.
> You'll have to "salt to suit" to fit the actual kind of test you're
> trying to make--it wasn't clear to me what that was, exactly, so I tried
> to just give an example "go-by" that you could use as a starting point.

I wanted to send all filenames ending with spectra1 to a cell array M and those without this ending to another array N. So, that's how I am hoping to distinguish between the 2 files:
6LT.001.dat
6LT.001_spectra1.dat

I did this with this command:
if length(nameX)>findstr(nameX,'spectra')+length('spectra')
M{} = ......
end

As I mentioned, I replaced nameY with nameX. And it seems to be working. My explanation is this:
findstr() is searching through the filenames for the string 'spectra'. It finds it at position 9 (in the name6LT.001_spectra.dat). Then, if the length of the filename (which includes spectra) > 9 + 7 (7 is length of the string 'spectra') then it assigns the filename to the array M, else it assigns it to the array N.

Is this the correct way to do it? Or is this not what was intended?

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: dpb

Date: 3 Jul, 2012 19:49:12

Message: 10 of 11

On 7/3/2012 2:36 PM, Eli wrote:
...

> Inorder to extract the 6LT001 from 6LT.001.dat, I used this:
> strrep(nameX(1:findstr(nameX,'.dat')-1),'.','')
>
...
> the '.dat' the way to go or is there some other indicator where I can
> stop (with the first command, it stopped at the underscore)?

Well, it all depends on the general form of the name you're looking for.
  For the specific example form, that's about the only thing to search
on that is unique if the length up to the .ext is variable.

There are other ways to find/accomplish the same thing of course from
sscanf() to fileparts() and/or searching backwards instead of forwards,
etc., etc, etc. The point is I can't know more about the actual
application than what you've shown as a specific example--I provided a
solution that works for it. Whether it will work in general for you
I've no way to know...

"Best" is whatever version is robust-enough to solve your general problem.

>
...

> I wanted to send all filenames ending with spectra1 to a cell array M
> and those without this ending to another array N. So, that's how I am
> hoping to distinguish between the 2 files:
...

> Is this the correct way to do it? Or is this not what was intended?

All that was intended was an example of one way to distinguish between
two more-or-less generically similar filenames w/ few assumptions on
what the names were other than that the one had an appended '1' on it.
I didn't know if they were intended to be the same or what; it was just
a starting point.

If the decision is as simple as the above statement then you might as
well just do a findstr() on the substring 'spectra1' in the file name
and forget anything more exotic at all.

Again, it's a case of use the general ideas and mold them to fit the
actual situation--don't rely on my examples as being the holy grail...

--

Subject: Reading file: 1 column of text in middle of 35 colums of numbers

From: Eli

Date: 5 Jul, 2012 17:15:12

Message: 11 of 11

Thanks. I practiced with a few strings and I think I get the basic idea.

Thanks for all the help. I believe that all my questions in this thread have been answered.

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
strrep Eli 2 Jul, 2012 18:49:14
length Eli 2 Jul, 2012 18:49:14
strfind Eli 2 Jul, 2012 18:49:14
regxp Eli 2 Jul, 2012 14:49:10
filenames Eli 2 Jul, 2012 14:49:10
number Eli 28 Jun, 2012 12:19:12
string Eli 28 Jun, 2012 12:19:12
textread Eli 28 Jun, 2012 12:19:12
textscan Eli 28 Jun, 2012 12:19:12
rssFeed for this Thread

Contact us