I am using textscan to extract data with an unknown number of columns.

21 views (last 30 days)
I have a whole bunch of FTIR data that is in jcamp-dx file format. It turns out this format is readable with textscan (thank god). It is in an odd format:
x data point, y data point, y data point, y data point, ..... new line,
x data point, y data point, y data point, y data point, etc)
and there is an unknown number of y data points per row of data. It seems like there is consistently between 8 and 10 y data points per x data point. There is likely a way to read to the end of the line and jump down to the next line, but I'm not sure exactly how to do it and need a bit of help. There is also 50 headerlines in each file. My file is too large to be attached (even after a compression). Link is here: http://www.filedropper.com/all_9
  1 Comment
dpb
dpb on 8 Feb 2016
Just attach a dozen lines or so w/o the header lines or with just one or two of them -- it's all that would be needed to illustrate the issues.

Sign in to comment.

Answers (1)

dpb
dpb on 8 Feb 2016
Presuming the file is comma-delimited as shown in the example, it's pretty-much trivial--while it's been put into "redhaired-stepchild" status by TMW, the old standby textread is ideal here...
I made up a dummy file per the description with variable number of entries per record and a header line...
>> type nick.csv
Header row
0.34403,0.30376,0.55368,0.76665,0.77579,0.43732
0.53366,0.089219,0.55029,0.56435,0.66139,0.61261,0.83435,0.85292
0.62776,0.52153,0.16042,0.38957,0.46966,0.73748,0.79134
0.44674,0.82529,0.11709,0.49014,0.21977,0.30328,0.65661,0.87376
0.81111,0.76393,0.39855,0.8869,0.60259
0.14483,0.94764,0.83138,0.90505,0.18426,0.83551,0.38668,0.20532
0.97483,0.33355,0.185,0.49839,0.19751,0.36931
0.83376,0.38974,0.50079,0.52923,0.86199,0.66134,0.59533
0.3401,0.15041,0.12631,0.9097,0.12565,0.89572,0.78164,0.82757
0.61624,0.3337,0.86463,0.5786,0.64558,0.2741,0.98767,0.26426
>> textread('nick.csv','','delimiter',',','headerlines',1,'emptyvalue',nan)
ans =
3.4403e-01 3.0376e-01 5.5368e-01 7.6665e-01 7.7579e-01 4.3732e-01 NaN NaN
5.3366e-01 8.9219e-02 5.5029e-01 5.6435e-01 6.6139e-01 6.1261e-01 8.3435e-01 8.5292e-01
6.2776e-01 5.2153e-01 1.6042e-01 3.8957e-01 4.6966e-01 7.3748e-01 7.9134e-01 NaN
4.4674e-01 8.2529e-01 1.1709e-01 4.9014e-01 2.1977e-01 3.0328e-01 6.5661e-01 8.7376e-01
8.1111e-01 7.6393e-01 3.9855e-01 8.8690e-01 6.0259e-01 NaN NaN NaN
1.4483e-01 9.4764e-01 8.3138e-01 9.0505e-01 1.8426e-01 8.3551e-01 3.8668e-01 2.0532e-01
9.7483e-01 3.3355e-01 1.8500e-01 4.9839e-01 1.9751e-01 3.6931e-01 NaN NaN
8.3376e-01 3.8974e-01 5.0079e-01 5.2923e-01 8.6199e-01 6.6134e-01 5.9533e-01 NaN
3.4010e-01 1.5041e-01 1.2631e-01 9.0970e-01 1.2565e-01 8.9572e-01 7.8164e-01 8.2757e-01
6.1624e-01 3.3370e-01 8.6463e-01 5.7860e-01 6.4558e-01 2.7410e-01 9.8767e-01 2.6426e-01
>>
You can, of course, get to the same place with textscan at the cost of an fopen|fclose pair plus some more rigamarole to get an array...
cell2mat(textscan(fid,'','delimiter',',','headerlines',1,'emptyvalue',nan,'collectoutput',1))
csvread will also work excepting it returns 0 as the only missing value possible which is ambiguous at best if there could be real zero values in the file.
All in all, textread is just simpler where it works; unfortunate TMW doesn't keep it up to the latest in alternate inputs (or make textscan accept optional file name and add the facility to convert to other than cell on output).
  3 Comments
dpb
dpb on 8 Feb 2016
Edited: dpb on 9 Feb 2016
Had this sidebar conversation not too long ago, Walter...but otomh don't recall if it was with you or Bruno or somebody else... :)
I did some digging then and in short, the syntax is still in the textread example for it in R2015 and earlier. It is in the example for textscan thru at least R14 but that particular example isn't in R2012b for some reason.
As for the usage, TMW uses it themselves. In the evolution of dlmread from R11 of being a standalone loop using fgetl and parsing fields line-by-line to textread with the empty format string in R12 (with a fallback to the old code if failed) and then to textscan also with the empty format string in R14 and thru at least R2012b (latest I have installed). I'd bet pretty heavily on it still being there and unlikely to change.
There well may have been a release in which it failed for skipped header lines, but it's been so long ago since I think it is now a moot point.
ADDENDUM I did some further testing...it appears that R12 introduced the feature of the empty string for format with textread. R11 failed on 'nick.csv' above with error complaining of badly-formatted format string whereas R12 with the identical statement as above parses it correctly. If one is using an earlier release than R12 it appears they, indeed are hosed (to use the technical term :) ). END ADDENDUM
At the previous time (last November) I submitted an enhancement request to document the behavior as it is an extremely important feature to be able to read uncounted fields() and if the facility were to be neutered, that would be a _*MAJOR_ setback in parsing externally generated or irregular files.
(*) More specifically, what the blank field does that is so significant vis a vis alternatives is to also trigger the automagic return of the shape of the input file. One can read the file with a single field, granted, but then you get a resultant column array without the missing locations being identified. Sometimes that may be what is desired, often it isn't.
dpb
dpb on 10 Feb 2016
"There are hints that as of R2015a ..."
I'd be interested in what/where those hints are, Walter...in the looking I've done I see nothing other than the example for textread is the one for displaying 'emptyvalue' which has been modified to also illustrate another issue of an integer conversion of -Inf to 0 so to show that they had to use a format string instead of the empty string that is used (but not commented on specifically) in earlier releases. The example for textread is the same numeric data but illustrates only the empty value field being other than default (and another difference beteenn the two is that fixed the initial poor choice of zero as a default to NaN).
Probably somebody at some time complained about returning 0 on a conversion using an integer field and this is how they chose to address the complaint rather than adding yet another example.
It illustrates yet again the problem with so much of the documentation being example-based instead of there being a complete set of rules written defining behavior; the only place much of the detailed behavior is given is by example.

Sign in to comment.

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!