Extract value from .txt. Weird lay out.

5 views (last 30 days)
laury
laury on 7 Mar 2014
Edited: dpb on 10 Mar 2014
Hi, so here is my problem: i'd like to extract values (double) from a file where sometimes different columns are not separated by any kind of delimiter.
The text file can look like this :
if true
% code
1994010103
12.05 54.60 38.00 0.28
12.10 54.60 43.00 0.30
13.10 54.60 99.00 0.33
13.15 54.60100.00 0.34
13.20 54.60 0.00 0.00
13.25 54.60128.00 0.16
end
and i'm interested in the values in the third column. The first row is a date/time and i should get rid of it.
My solution to this problem is :
if true
% code
fid = fopen('file');
T = textscan(fid,'%s','delimiter',{'\n'});
fclose(fid);
ngx=39;
ngy=34;
n=ngx*ngy;
t=5839;
for i =1:t
T{1}((i-1)*n+1)=[]; %get rid of the date/time which occurs every nth row
end
interest = zeros(length(T{1}),1);
for i =1:length(T{1})
interest(i) = str2double(T{1}{i}(12:18)); %extract the interesting characters from every row and convert them into a double
end
end
This code works, but i'm dealing with millions of rows and the loop makes the computation time really long..
If you have any idea of how to reduce the computation time, that'd be great !
Thanks

Accepted Answer

dpb
dpb on 7 Mar 2014
Edited: dpb on 7 Mar 2014
This has been a point of contention of mine "since forever" -- there's no automatic way to read fixed-width column data in C (and hence Matlab). TMW definitely needs to add a feature so I recommend to add your voice to the feature enhancement list for it. Maybe in 30 more years...
But, with tools as are--
c=textscan(fid,'%s','delimiter','\n'); c=char(c{:}); % read, to char array
c(1:nSkip:end,:)=[]; % delete the date rows
data=str2num(c(:,12:18)); % convert the desired columns
If need more columns, either duplicate above or use the known field widths and insert a delimiter in the desired locations then use textscan on the array.
ADDENDUM:
The best thing for Matlab if you can is to change the form in which the files are generated going forward to use a delimiter or at least increase the field width. But, of course, sometimes it isn't feasible to do so..
  8 Comments
Star Strider
Star Strider on 9 Mar 2014
Edited: Star Strider on 9 Mar 2014
I did, and I quoted this thread as my ‘justification’. (I also suggested an I/O format descriptor for engineering notation that would behave like the E/e descriptor, since that issue arose recently.) I figure the more ‘votes’ this issue gets in the form of Service Requests, the more likely it is to appear sooner rather than later.
If you’re logged in here, you should also be logged into everything else. (I like right-clicking because it’s easier to keep track of things. I just close the extra tabs when I’m finished.)
See if this works:
  • at the top of this page, right click on MathWorks.com
  • right click on Support
  • near the end of the page, right click on My Service Requests and create a new service request
I suggest you copy the URL for this thread first, then paste in in your Service Request. You’ve pretty much discussed everything of significance here, so there’s no need to retype it there.
Also, although you can’t vote for your own answer (I added my vote) you can vote for the question. (I did.)
dpb
dpb on 9 Mar 2014
Edited: dpb on 9 Mar 2014
OK, that path did work; following the direct link for some reason didn't recognize the login info and I get tired of the barriers very quickly in my dotage. :(
So, I did it again -- if it's as much longer before TMW does anything since my first submittals, I'll be about 100...I guess it will be a_good_thing (tm) if I am still able to use Matlab at all at that point to see the results. :)
I can't even count the number of times this has come up just in the <2 yr since I started following the forum a little as promised for the complementary updated license TMW generously provided after retirement but there have been quite a number that asked the specific question OP did and several others that have the problem as the underlying reason for the query even though the question wasn't direct owing to the poster being bogged down in the processing so the question asked was fairly far removed from root cause.
For some reason beyond my ken such a fundamental lack apparently has just never seemed important to anybody inside TMW with the clout to actually get anything done about it.
Having once done the mex interface to FORMAT, it has some difficulties if try to implement a fully-functional version that handles every possible feature, but a workable subset that handles probably 90-95% of real world cases isn't too bad and I'd think TMW should be able to do it in at most a couple of months or so if just would dedicate some resources to it. I thought at the time my version was probably about 80% of the way to being releasable back then but even with that as a starter wasn't able to generate any interest.

Sign in to comment.

More Answers (1)

Ken Atwell
Ken Atwell on 7 Mar 2014
Try replacing your second loop with something along the lines of:
T = strjoin(T', '\n');
interest = textscan(T, '%*11c %6f %*[^\n]');
interest = interest{1};
strjoin is a newer function to convert your cell array of strings to a single long string, which is what textscan will expect. If strjoin is not available in your version of MATLAB, http://www.mathworks.com/matlabcentral/fileexchange/31862-strjoin may help.
The textscan formatter string has three parts:
  1. Ignore the first 11 characters (%*11c)
  2. In interpret the next six character as a floating point number (the data you are interested in)
  3. Ignore the remainder of the line
  1 Comment
dpb
dpb on 8 Mar 2014
Edited: dpb on 10 Mar 2014
Iff'en you're going to do that, may as well just write --
c=cell2mat(textscan(fid,'%*11c %6f %*[^\n]','delimiter',''))
which does as you note correctly skip the right number of columns. For OP's problem, he could then loop over the above also including
'headerlines',1
and the numeric count for the number of lines per subsection in the file.
Solves the OP's specific problem since only wants the one column, but still there's the gaping hole in Matlab functionality of the general case of parsing the whole file correctly w/o machinations.
I've posted examples like this during this discussion before but I don't recall you being one of the conversants so the following clearly demonstrates what's simply broke in C--
>> cc=(textscan(fid,['%5s' repmat('%6s',1,3)],'delimiter',''))
>> [cc{1} cc{2} cc{3} cc{4}]
ans =
'12.05' '54.60 ' '38.00 ' '0.28'
'12.10' '54.60 ' '43.00 ' '0.30'
'13.10' '54.60 ' '99.00 ' '0.33'
'13.15' '54.601' '00.00 ' '0.34'
'13.20' '54.60 ' '0.00 ' '0.00'
'13.25' '54.601' '28.00 ' '0.16'
>>
NB the second and subsequent columns--they all begin with a nonwhite character instead of the blank or character that is the actual content in the initial field column if one counts position based on the format string field widths. That is, while consistent with the definition of what the field width means in C, simply a practically wrong-headed definition. Consequently the 2nd has the string '54.60_' or '54.601' NOT the expected/needed/desired '_54.60' where I used the underscore to emphasize the blank. And, it ends up with the last column not even being full width.
C simply cannot keep its hands off the trailing location despite being explicitly told to do so. In kindergarten you get sent to the corner for timeout if you keep taking your neighbor's crayon... :)
ADDENDUM:
BTW, the above also depends upon the fact that there's always a whitespace character AFTER the 3rd column--observe what happens if make the case a littler tougher:
>> type test.dat
13.10 54.60 99.00 0.33
13.15 54.60100.00200.34
13.20 54.60 0.00300.00
13.25 54.60128.00-40.16
Now I've filled in the full 6-column field in the 4th column in some lines so the whitespace isn't there. Now the results are really screwed up and you have to go back to the actual column-counting parsing. I'd kinda' forgotten about the problem one runs into with real files concentrating too much on the specific solution to OP's particular problem/request.
>> cc=cell2mat(textscan(fid,['%5f' repmat('%6f',1,3)],'delimiter',''))
cc =
13.1000 54.6000 99.0000 0.3300
13.1500 54.6010 0.0020 0.3400
13.2000 54.6000 0.0030 0
13.2500 54.6010 28.0000 -40.1600
The correct array is
13.1000 54.6000 99.0000 0.33
13.1500 54.6000 100.0000 200.34
13.2000 54.6000 0.0000 300.00
13.2500 54.6000 128.0000 -40.16
Note also the last anomaly in behavior--owing to the '-', the parser manages to still get it right. I've worked that out before on just exactly how the rules say so, but it's convoluted enough I don't recall just otomh exactly how it does it but it has to do with what is done with whitespace.

Sign in to comment.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!