error detection when reading in text

5 views (last 30 days)
I have a long file of which this is a typical sample
11-Aug-2015 102.3
12-Aug-2015 103.4
14-Aug-2015 101.7
15-Aug-2015 102.1
21-Aug-2015 102.8
23-Aug-2015 102.6
What is the easiest way to read in data like this, with error detection. Suppose for example (as happened to me recently) the fullstop in the middle of one of the numbers is mistakenly replaced by a comma. Or there could be some other obvious error in the format of the data file. I would like the script to tell me at least the line number on which the error occurs. Here are the relevant lines from the script that I would like to improve, in particular replacing textread by something else, since textread is now deprecated:
[cellfiledate,fileweight]=textread('DataFile','%s %f');
dnumsFile=datenum(cellfiledate,'dd-mmm-yyyy');
Should I use textscan? Thanks for any help.

Accepted Answer

Walter Roberson
Walter Roberson on 20 Sep 2015
See attached file. It will report bad entries, the reason, their line numbers, and the line contents.
Note: the code does not report all of errors on all of the lines simultaneously; it stops as soon as it finds one kind of problem, but it does report all lines with the same kind of problem.
Numbers are first checked for invalid characters, and only if they all pass is conversion is attempted and conversion errors reported. This is done because str2double() considers comma to be a valid character to be ignored, whereas you want to detect and complain about it. The checking for valid characters allows for signs, digits, decimal point, and the dDeE exponential characters but in that phase does not check to see if the arrangement of characters makes sense.
I just point this out to explain why after cleaning up all of the bad numbers reported once (the ones with invalid characters) you may get a second batch of bad number reports.
  2 Comments
David Epstein
David Epstein on 21 Sep 2015
Edited: David Epstein on 21 Sep 2015
This is a very comprehensive answer. Error detection is clearly more of a nuisance than I had anticipated. Thanks for all the work.
Please explain this line
catch ME %#ok<NASGU>
I cannot find "catch ME" in the Matlab documentation
Walter Roberson
Walter Roberson on 21 Sep 2015
try
statements
catch exception
statements
end
so if the statement you "try" fails, then the "catch" section is executed, with the details of the exception assigned to the variable named, here "ME". Inside the catch section you can extract details of the error from the structure here named ME. I do not happen to need to use the details of the error, which leaves the variable ME unusued. The MATLAB error analysis routines would notice that ME is not used anywhere and so would warn about it not being used, but the %#ok<NASGU> comment tells the error analysis routines that Yes, I know what I am doing so don't complain. If I did not put a variable there then the error analysis routines would complain that I was using the old-style of try/catch/end .
Yes, error detection is a nuisance. You need to figure out which kind of line you are looking at (which could be tricky if you have multiple varieties of line, as an error in the input could leave you uncertain about which kind of line you are looking at.) Once you know the kind of line, you need to figure out if you have the right number of columns, and then for each column you need to figure out if the content is "plausible" (which can be different for every line.) Once you know if the columns are plausible then you can run a more expensive computation to convert the contents of the column to the desired representation.
However, you can make the processing simpler if you don't mind just complaining that a line is "wrong" without saying much about what is wrong with it. In the case of your file, it would be possible to do a regular expression search over it:
plausible_lines = regexp(data_as_lines'^\s*\d{1,2}-[a-zA-Z]{3}-\d{4}\s+[0-9dDeE+.-]+\s*$', 'lineanchors');
implausible_lines = find(cellfun(@isempty, plausible_lines));
then if implausible_lines is non-empty you complain that "something" is wrong with those lines. Once you know that the lines are plausible, you can split them, do the numeric conversion and check for nan for failures, and run the date conversion routines and check for errors. The regexp above simplifies because it checks for the right number of fields at the same time it checks each field for characters or character orders that cannot possibly be right. It is even possible to code a regexp expression that checks whether a number is written in a valid floating point format that might include exponent, but it comes out as a long expression that is easy to get wrong.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!