Thread Subject: reading an annoying ascii text file

Subject: reading an annoying ascii text file

From: Derik

Date: 25 Oct, 2009 21:45:03

Message: 1 of 3

Dear Sunday readers,
I am trying to read the below file format. I tried textscan but I must be missing things... I either errors or empty cell (I run version7.5.0 2007b)
I have several difficulties as a beginner:
* all these doublequotes seem not to be well understood
* Unfortunately the comma delimiter is also the thousand delimiter
* I would like to have the first line transformed as the variable names of the columns
* I would like to change the date string "MM/DD/YYYY" to matlab dates
* the file is around 7000 lines and 70 variables

extract of the file:
"Fund_ID","Fund","Firm","Structure","Minimum_Investment","Additional_Investment","Inception","Reporting"
"10003","Enterprise Fund Ltd. (Class E) - Emerging Markets","Advantage Management Limited","Corporation","10,000","","06/01/2003","Monthly"

Thank you very much in advance
derik

Subject: reading an annoying ascii text file

From: Doug Schwarz

Date: 26 Oct, 2009 03:43:59

Message: 2 of 3

In article <hc2gsv$375$1@fred.mathworks.com>,
 "Derik " <d.nospam.schupbach@lombardodier.please.com> wrote:

> Dear Sunday readers,
> I am trying to read the below file format. I tried textscan but I must be
> missing things... I either errors or empty cell (I run version7.5.0 2007b)
> I have several difficulties as a beginner:
> * all these doublequotes seem not to be well understood

Use the %q format with textscan.


> * Unfortunately the comma delimiter is also the thousand delimiter
> * I would like to have the first line transformed as the variable names of
> the columns

Don't do this, it's more trouble than it's worth. Instead use the
column headers as field names for a structure array.


> * I would like to change the date string "MM/DD/YYYY" to matlab dates
> * the file is around 7000 lines and 70 variables
>
> extract of the file:
> "Fund_ID","Fund","Firm","Structure","Minimum_Investment","Additional_Investmen
> t","Inception","Reporting"
> "10003","Enterprise Fund Ltd. (Class E) - Emerging Markets","Advantage
> Management Limited","Corporation","10,000","","06/01/2003","Monthly"
>
> Thank you very much in advance
> derik

Here's what I would do (assume your data is in a file called derik.dat):

% Read in entire file.
fid = fopen('derik.dat');
header = textscan(fid,'%q%q%q%q%q%q%q%q',1,'Delimiter',',');
raw = textscan(fid,'%q%q%q%q%q%q%q%q','Delimiter',',');
fclose(fid);
 
% Store data in a structure array, data.
fields = [header{:}];
raw_array = [raw{:}];
data = cell2struct(raw_array,fields,2);
 
% Convert column 5 (Minimum_investment) from string to numeric.
min_invest_str = {data.(fields{5})};
min_invest = str2double(min_invest_str);
min_invest_cell = num2cell(min_invest);
[data.(fields{5})] = min_invest_cell{:};
 
% Convert column 7 (Inception) into date numbers.
date_str = {data.(fields{7})};
date_num = datenum(date_str,'mm/dd/yyyy');
date_num_cell = num2cell(date_num);
[data.(fields{7})] = date_num_cell{:};

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.

Subject: reading an annoying ascii text file

From: Branko

Date: 27 Oct, 2009 08:13:04

Message: 3 of 3

"Derik " <d.nospam.schupbach@lombardodier.please.com> wrote in message <hc2gsv$375$1@fred.mathworks.com>...
> Dear Sunday readers,
> I am trying to read the below file format. I tried textscan but I must be missing things... I either errors or empty cell (I run version7.5.0 2007b)
> I have several difficulties as a beginner:
> * all these doublequotes seem not to be well understood
> * Unfortunately the comma delimiter is also the thousand delimiter
> * I would like to have the first line transformed as the variable names of the columns
> * I would like to change the date string "MM/DD/YYYY" to matlab dates
> * the file is around 7000 lines and 70 variables
>
> extract of the file:
> "Fund_ID","Fund","Firm","Structure","Minimum_Investment","Additional_Investment","Inception","Reporting"
> "10003","Enterprise Fund Ltd. (Class E) - Emerging Markets","Advantage Management Limited","Corporation","10,000","","06/01/2003","Monthly"
>
> Thank you very much in advance
> derik

Another approach using regexp:

fid = fopen(filename,'rt');
val=textscan(fid,'%s','delimiter','','headerlines', 0);
fclose(fid);

Header=regexp(val{:}{1},'(\w+)','match'); % Remove all numeric
as=regexprep(val{:}{2},'\d*,\d{3}','${strrep($&,'','','''')}'); % Replace 10,000 with 10000
as=regexprep(as,'\d{2}/\d{2}/\d{4}','${num2str(datenum($&, ''mm/dd/yyyy'')'')}'); %Convert Gregorian tu Julian
as=regexprep(as,'"',''); % Remove double quotes
Data=regexp(as, ',', 'split'); % Split data
Data{5}=str2num(Data{5}); % Convert string to numeric
Data{7}=str2num(Data{7}); % Convert string to numeric
DATA = cell2struct(Data,Header,2);

Branko

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
textscan Derik 2 Nov, 2009 18:08:16
ascii Derik 2 Nov, 2009 18:08:08
regexp Branko 27 Oct, 2009 04:14:08
rssFeed for this Thread

Contact us at files@mathworks.com