MATLAB Answers

0

Extract data from a non-rectangular text file, efficiently

Asked by Paolo Binetti on 5 Mar 2017
Latest activity Commented on by dpb
on 6 Mar 2017
Accepted Answer by dpb
In spite of going through Matlab help and several attempts, I still do not get how to use "textscan" or other relevant functions to read from a simple non-rectangular file like the attached sample.
All I want is to efficiently extract two integers from the first line and two cell arrays (or char arrays) each containing the strings respectively to the left and two the right of | in the subsequent lines. Here is the code I tried, but it does not work:
fid = fopen('dataset.txt');
coeffs = textscan(fid, '%s%s', 1, 'Delimiter', ' ');
sequences = textscan(fid, '%s%s', 'Delimiter', '|');
fclose(fid);
Using "fileread" and "regexp" also seem to be options, but "regexp" seems slower than "textscan".
BTW if anybody can point me to a link where I can find how to extract data from files using Matlab, with examples providing ample coverage of use cases, that would be great.

  0 Comments

Sign in to comment.

1 Answer

Answer by dpb
on 5 Mar 2017
Edited by dpb
on 5 Mar 2017
 Accepted Answer

Close, but the first record is numeric, not string...
>> n=cell2mat(textscan(fid,'%f',2,'collectoutput',1))
n =
50
200
>> s=textscan(fid,'%s%s','delimiter','|','collectoutput',1)
s =
{5x2 cell}
>> s{:}
ans =
'GATGGAGTGCGGGGTGGTTGACTAGCATGGGCCCTAGGATCGCTGACGTG' 'TTGACCAAGAACAAACGTTGTACGTATTTTCGATATAATACAGTAAGCTA'
'ACTTTTTACTAAACATAAGTTCGATTTCCACATCTTCCCGCGACCATCAG' 'TTACAGCCTGCTAATACGTTCTGTTTAAATGCGTAATTAGTAGCGCTCAG'
'TGCATGAACGACGGTAGGTCCACCCGTTGTAATGCGATAGCCTATGTAGC' 'CACAAGTTCATTTTTCAAATCGATAACCTGTGGGAGTATTCTTCGGCATC'
'GATCTTGCAGGCCGGGCGCTGGCGATCTGCGCGCGACATGGCCTGCAGTG' 'GCGACCTGCTTTTCGGTTGTAACGGGAGTGCGCCTACGCGCGCAAGATAC'
'TGAGTTTAGTCACTGATCTATAACACCAAGTGGGCGCGGTAGCCGATTAG' 'CATCTTCCCGCGACCATCAGGTTTGCCCCAGTAACGCGCCTGTTGCCTGT'
>> fid=fclose(fid);
>>
As for the lament, textscan has a fair number of examples of different types of files to study and the forum here is replete with special cases. It's not possible to have examples that cover all possible issues; one needs must look for the general principles underlying the examples and consider how they relate to the file at hand.
For the most part, the issue is simply building a format string that matches the record structure and then applying that to the file in the proper sequence for the proper number of times. The most difficult issue in general does have to do with processing undelimited strings or fixed-width files as the C format rules on field width are based on the concept of fields separated by white space as opposed to actual firm character column counts; hence when one use '%s' on a field that isn't really wanted to be treated as such, havoc can often ensue...

  2 Comments

Thank you, dpb. I tried your solution. I ended up using a combination of regexp and fileread, exploiting extra info from my specific problem (not included in my question, because I was after a more general solution. Mine is not a general solution, but it works and is good enough.
As for what you interpreted as a lament, it's just that I spent a few hours trying to reuse code I had, then checking out the help, googling, and trying more solutions, no joy. So, besides the specific question, I was just hoping to find a self-contained resource with lots of examples, although certainly not exhaustive. My hope was to find a way to code simple stuff like this in one minute rather than in hours and asking help.
"My hope was to find a way to code simple stuff like this in one minute rather than in hours"
Sounds like you're trying to make it too complicated...it took less than a minute to write the above solution; including copying the data to make a demo test file couldn't have been more than a couple; with a couple of iterations to test and compare result with/without 'collectoutput' parameter certainly still well under five minutes, total.
The real key to proficiency in this regard is simply "time in grade"; using textscan of the other formatted input routines becomes much more manageable with practice albeit the possible number of options and formatting may seem overpowering initially until one gains familiarity with just how C format strings work.
As an aside, I think it unfortunate that it is C that is followed/used, Fortran FORMAT forms are much simpler to write for the duplicated fields and recursion, etc., etc., and also deal with fixed-width fields much more logically than the C version.
While totally generic input routines are and have been the Holy Grail of application programming since the invention of the mechanical computer, the problem is there is simply too much variation in possible format and data content for that to be practical excepting for some special cases. TMW has built several, importdata is pretty capable but it's just not reasonable as yet to take any file as a black box and have only one routine to automagically read it into a useful form for further processing. It's trivial, of course, to simply load a cellstr textual duplicate or fread a binary image, but for most purposes that's not sufficient to do much with unless one is simply filtering the file for some content or the like.
If you have a lot of files similar to this and there is additional particular information known of their structure and you have specific processing needs, then sure, go ahead and write a specific parser for them. If the files are from some well-known source or follow some industry/academic protocol for a given field, then it could even make sense to submit it as an enhancement request or, if only of somewhat lesser commonality submit to the File Exchange for some limited notoriety for yourself...and appreciation from others with the same issue. :)

Sign in to comment.