Path: news.mathworks.com!not-for-mail
From: "Branko " <bogunovic@mbss.org>
Newsgroups: comp.soft-sys.matlab
Subject: reading alphanumeric data
Date: Thu, 29 Oct 2009 07:26:01 +0000 (UTC)
Organization: National Institute of Biology
Lines: 82
Message-ID: <hcbg29$3td$1@fred.mathworks.com>
References: <hb6otr$oq4$1@fred.mathworks.com> <hb9c01$32e$1@fred.mathworks.com> <hbn2eo$p1$1@fred.mathworks.com> <hbp3cn$n2i$1@fred.mathworks.com> <hbs8m5$86$1@fred.mathworks.com> <hbsa1m$rbd$1@fred.mathworks.com> <hbsbcr$omg$1@fred.mathworks.com> <hc3k9d$6ha$1@fred.mathworks.com> <hc9kdi$a3m$1@fred.mathworks.com>
Reply-To: "Branko " <bogunovic@mbss.org>
NNTP-Posting-Host: webapp-03-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1256801161 4013 172.30.248.38 (29 Oct 2009 07:26:01 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Thu, 29 Oct 2009 07:26:01 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 237386
Xref: news.mathworks.com comp.soft-sys.matlab:580855


"burcu " <burcu102@hotmail.com> wrote in message <hc9kdi$a3m$1@fred.mathworks.com>...
> Hi Branko,
> Thank you very much for this code. I have been reading the help files of cat and regexp and tried the code you've provided to me with my dataset. 
> 
> My dataset includes may rows like this:
> 
> 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00
> 
> So i have applied your code like this, and take an error:
> 
> >> fid=fopen('ked.txt');
> >> data=textscan(fid, '%s', 'delimiter',',');
> >> fclose(fid);
> >> data =cat(1,data{:});
> >> string_data=regexp(data,'([A-Z a-z]+)','match');
> >> numeric_data=regexp(data, '([0.00-9.99 0-100000]+)','match');
> >> numeric_data=cat(1,numeric_data{:});
> >> string_data=cat(1,string_data{:});
> ??? Error using ==> cat
> CAT arguments dimensions are not consistent.
> 
> Do you have any idea or advice on this? Besides is [0.00-9.99 0-100000] is true regarding to my data type? i have numerical variables like 10027 etc so i also want to give a range like this with your 0.00-9.99
> 
> One more thing: i couldnt be sure why you set the format just to %s in your code. Is it the way we are using with regexp function? I used to define my data like:
> %u8, %s, %s, %s, %u16, %u16, %u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%u8,%f,%f,%f,%f,%f,%f,%f, %u16, %u16,%f,%f,%f,%f,%f,%f,%f,%f
> 
> Thanks!
> Burcu
> ----------------------------
> > 
> > Here is one approach to solve your problem.As I mentioned previously you should use regexp function which is useful in cases like this.
> >  
> > fid = fopen(filename,'rt');
> > data=textscan(fid,'%s','delimiter','','headerlines', 0);
> > fclose(fid);
> > 
> > % Engine - use regexp!
> > data=cat(1,data{:});
> > String_data=regexp(data,'([A-Z a-z]+)','match'); % Remove all numeric
> > Numeric_data=regexp(data,'([0.00-9.99]+)','match'); % Remove all letters 
> > 
> > Numeric_data=cat(1,Numeric_data{:});
> > String_data=cat(1,String_data{:});
> > DATA=[String_data(:,1:3) Numeric_data(:,1:end-1) String_data(:,end)];
> > 
> > Branko


Burcu,

Example above was done for data that you provide (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). Here is example. 

data={'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.'
'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.'
'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.'};

% Engine - use regexp!
String_data=regexp(data,'([A-Z a-z]+)','match'); % Remove all numeric
Numeric_data=regexp(data,'([0.00-9.99]+)','match'); % Remove all letters 

Numeric_data=cat(1,Numeric_data{:});
String_data=cat(1,String_data{:});
DATA=[String_data(:,1:end-1) Numeric_data(:,1:end-1) String_data(:,end)];

I used (%s ) to read alll data as string since regexp can be performed only on strings and not numerics(%f).

Same for above problem (in this case you have 40 columns):
data={'0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00'
'0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00'
'0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00'};

% Engine - use regexp!
String_data=regexp(data,'([A-Z a-z]+)','match'); % Remove all numeric
Numeric_data=regexp(data,'([0.00-9.99]+)','match'); % Remove all letters 

Numeric_data=cat(1,Numeric_data{:});
String_data=cat(1,String_data{:});
DATA=[String_data(:,1:end-1) Numeric_data(:,1:end-1) String_data(:,end)];  

Try to copy above data in example file and run it(using %s) and should work for you-on my ML is working. 

Branko