Code covered by the BSD License  

Highlights from
Error-tolerant parsing of newline-delimited data

Error-tolerant parsing of newline-delimited data

by

 

Adaptive parsing of newline-separated data. Handles bad lines WITHOUT reading line-by-line.

adaptive_parse(dataOrFid,parseFcnHndl,varargin)
function [parsed] = adaptive_parse(dataOrFid,parseFcnHndl,varargin)
%
% adaptive_parse.m--Adaptive function for parsing newline-separated data
% files or cell/character arrays. Adjusts amount of data being parsed in
% order to be as fast as possible while remaining within memory capacity
% and skipping over badly-formed lines. User supplies a parsing function,
% customised for his/her specific data file format, which is passed to
% adaptive_parse as a function handle.
%
%   PARSED = ADAPTIVE_PARSE(DATAORFID,PARSEFCNHNDL), where PARSEFCNHNDL is
%   a function handle, applies the corresponding parsing function to the
%   data referred to by DATAORFID. DATAORFID is either a file identifier
%   or an array (character array or cell array of characters). The PARSED
%   output argument is in whatever form the user's parsing function
%   outputs.
%
% The user's parsing function should have a declaration of the form
%   function [parsed] = myparsefcn(C,lineNumbers,options)
% where C is a cell array of strings. The lineNumbers and options arguments
% do not need to be used by the parsing function, but they must be in the
% function declaration. The parsing function does not need to handle bad
% data lines gracefully--have it throw an error when bad data lines are
% encountered, and adaptive_parse will adjust its parameters accordingly.
%
% All input arguments apart from DATAORFID and PARSEFCNHNDL are optional,
% but the MAXCHUNKLENGTH input argument may be required to prevent Matlab
% from crashing under MicroSoft Windows; it can also greatly speed up
% operation under Linux.
%   PARSED = ADAPTIVE_PARSE(...,'MAXCHUNKLENGTH',MAXCHUNKLENGTH,...
% limits the number of lines to be read in at one time. If maxChunkLength
% is not specified, the default is to use the value of intmax. The larger
% the value of MAXCHUNKLENGTH, the more quickly adaptive_parse() will run,
% but only up to the point that Matlab exceeds its memory capacity. If this
% occurs, adaptive_parse() will adjust by trying again with a smaller
% number of lines. Under Linux, this will slow operations, but under
% Windows, Matlab may freeze up. Trial and error is the only way to
% determine the optimum value for MAXCHUNKLENGTH on your system.
% 
% Functions for initialising the output variable, appending parsed data to
% it and post-parse clean-up (e.g., removing excess NaNs from the
% initialised data) are supplied (see ap_default_initfcn,
% ap_default_appenderfcn, and ap_default_cleanupfcn in the body of the
% code). These default functions can be used as a guide for creating custom
% versions. The default functions assume that the parsing function will
% return a simple structured variable with non-nested fields--anything more
% complicated than that will require custom functions. A custom
% initialisation function is especially important, as the default behaviour
% is to dynamically append newly-parsed data to the output variable, which
% is highly inefficient.
%
% Custom functions are specified using parameter/value pairs, with the
% values being function handles:
%   PARSED = ADAPTIVE_PARSE(...,'APPENDERFCNHNDL',APPENDERFCNHNDL,...
%                               'CLEANUPFCNHNDL',CLEANUPFCNHNDL,...
%                               'INITFCNHNDL',INITFCNHNDL,...
%
% Optional input arguments can be passed to the parse, init, append and
% cleanup functions by calling adaptive_parse with the 'parseOptions',
% 'initOptions', 'appendOptions' and 'cleanupOptions' parameter/value pairs
% specified:
%   PARSED = ADAPTIVE_PARSE(...,'PARSEOPTIONS',PARSEOPTIONS,...
%                               'INITOPTIONS',INITOPTIONS,...
%                               'APPENDOPTIONS',APPENDOPTIONS,...
%                               'CLEANUPOPTIONS',CLEANUPOPTIONS,...);
% The 'parseOptions', 'initOptions', 'appendOptions' and 'cleanupOptions'
% parameters are structured variables. The fields of these variables can be
% whatever is required by the custom parse, init, append and cleanup
% functions. 
%
% Efficiency can be greatly improved in some cases by pre-parsing data
% while it is being read in. This is done by supplying a format string for
% the textscan program to use. For example, supply a textscanFormat of
% '%4f%2f%2fT%2f%2f%f%s' when the data to be parsed begins with a
% yyyymmddTHHMMSS time string, and the cell array will already be split up
% into separate cells for year, month, day, hour, minute, seconds and
% non-timestamp data before it is even passed to the parsing function. The
% textscanFormat is specified as a parameter value pair:
%   PARSED = ADAPTIVE_PARSE(...,'TEXTSCANFORMAT',TEXTSCANFORMAT,...
% Other arguments for customising textscan's behaviour can be specified
% using the 'textscanArgs' parameter/value pair:
%   PARSED = ADAPTIVE_PARSE(...,'TEXTSCANARGS',TEXTSCANARGS,...
% where TEXTSCANARGS is a cell array of parameter/value pairs. Pay
% particular attention to the textscan argument 'BufSize', which limits the
% length of string to read in (default 4095 bytes).
%
% The 'initOptions' parameter/value pair is used to supply additional
% information to the initialisation function. initOptions is a structured
% variable containing fields giving the number of lines expected, expected
% variable names, etc.
%   PARSED = ADAPTIVE_PARSE(...,'initOptions',initOptions,...
%
% The 'progressInfo' structured variable is also passed as a
% parameter/value pair:
%   PARSED = ADAPTIVE_PARSE(...,'PROGRESSINFO',PROGRESSINFO,...
% progressInfo provides information needed by adaptive_parse to display
% its progress; it has the following fields: 
%   'increment'--An integer specifying how frequently progress messages
%                will be displayed;  
%   'incrementUnits'--'percent', 'lines', or 'bytes'.
%   'endPoint'--An integer specifying where the end of the data will be
%               encountered.
%   'endPointUnits'--'lines' or 'bytes'.
%
% Verbosity of output is controlled by the 'verbosity' parameter:
%   PARSED = ADAPTIVE_PARSE(...,'VERBOSITY',VERBOSITY,...
% Set VERBOSITY to 0 for silent operation, 1 for minimal output
% (out-of-memory warnings and little else), and 2 for normal output. A
% verbosity of 0 overrides any progressInfo settings. 
%
% Syntax: parsed = adaptive_parse(dataOrFid,parseFcnHndl,...
%                      <'appenderFcnHndl',appenderFcnHndl>,...
%                      <'cleanupFcnHndl',cleanupFcnHndl>,...
%                      <'initFcnHndl',initFcnHndl>,...
%                      <'textscanFormat',textscanFormat>,...
%                      <'textscanArgs',textscanArgs>,...
%                      <'maxChunkLength',maxChunkLength>,...
%                      <'initOptions',initOptions>,...
%                      <'parseOptions',parseOptions>,...
%                      <'appendOptions',appendOptions>,...
%                      <'cleanupOptions',cleanupOptions>,...
%                      <'verbosity',verbosity>,...
%                      <'progressInfo',progressInfo>)
%
% e.g.,   % Read in from file with minimal arguments:
%         dataOrFid = fopen('adaptive_parse_demo_data.dat');
%         parseFcnHndl = str2func('adaptive_parse_parse_demo');
%         parsed = adaptive_parse(dataOrFid,parseFcnHndl);
%         fclose(dataOrFid);
%
% e.g.,   % Read in from file with extra options specified:
%         dataOrFid = fopen('adaptive_parse_demo_data.dat');
%         parseFcnHndl = str2func('adaptive_parse_parse_demo');
%         initFcnHndl = str2func('adaptive_parse_init_demo');
%         appenderFcnHndl = str2func('adaptive_parse_appender_demo');
%         cleanupFcnHndl = str2func('adaptive_parse_cleanup_demo');
%         fseek(dataOrFid,0,'eof'); endPoint=ftell(dataOrFid); frewind(dataOrFid);
%         progressInfo = struct('increment',5,'incrementUnits','percent','endPoint',endPoint,'endPointUnits','bytes');
%         initOptions = struct('numExpectedLines',2000);
%         textscanFormat = '%4f%2f%2fT%2f%2f%fZ%s';
%         textscanArgs = {'BufSize',8000};
%         maxChunkLength = 10000;
%         parsed = adaptive_parse(dataOrFid,parseFcnHndl,...
%                      'initFcnHndl',initFcnHndl,...
%                      'appenderFcnHndl',appenderFcnHndl,...
%                      'cleanupFcnHndl',cleanupFcnHndl,...
%                      'textscanFormat',textscanFormat,...
%                      'textscanArgs',textscanArgs,...
%                      'progressInfo',progressInfo,...
%                      'initOptions',initOptions,...
%                      'maxChunkLength',maxChunkLength,...
%                      'verbosity',1);
%         fclose(dataOrFid);
%
%         % Pass data as an input argument rather than reading from file:
%         fid = fopen('adaptive_parse_demo_data.dat');
%         dataOrFid = textscan(fid,'%s','delimiter','\n');
%         dataOrFid = dataOrFid{1}; % a cell array of strings
%         fclose(fid);
%         initFcnHndl = str2func('adaptive_parse_init_demo');
%         parseFcnHndl = str2func('adaptive_parse_parse_demo');
%         appenderFcnHndl = str2func('adaptive_parse_appender_demo');
%         cleanupFcnHndl = str2func('adaptive_parse_cleanup_demo');
%         endPoint=size(dataOrFid,1); 
%         progressInfo = struct('increment',5,'incrementUnits','percent','endPoint',endPoint,'endPointUnits','lines');
%         initOptions = struct('numExpectedLines',2000);
%         maxChunkLength = 10000;
%         parsed = adaptive_parse(dataOrFid,parseFcnHndl,...
%                      'initFcnHndl', initFcnHndl,...
%                      'appenderFcnHndl',appenderFcnHndl,...
%                      'cleanupFcnHndl',cleanupFcnHndl,...
%                      'progressInfo',progressInfo,...
%                      'initOptions',initOptions,...
%                      'maxChunkLength',maxChunkLength);

% Change log:
%
% 2012-01-11, kpb--Added 'parseOptions', 'initOptions', 'appendOptions' and
% 'cleanupOptions' parameters (initOptions renamed from 'initInfo').

% Developed in Matlab 7.11.0.584 (R2010b) on GLNX86
% for the VENUS project (http://venus.uvic.ca/).
% Kevin Bartlett (kpb@uvic.ca), 2011-03-22 10:07
%-------------------------------------------------------------------------

% Possible improvements: 
% 1) Add "optimism" parameter. If optimism in file quality is low, then
% will ramp up chunk length more slowly.

% % Set the global variable DO_DBCATCH to true for debugging.
% global DO_DBCATCH;
% DO_DBCATCH = false;

% Verbosity constants:
SILENT = 0;
MINIMAL = 1;
NORMAL = 2;

% Parse input arguments.
%isaFcnHndl = @(x) isa(x, 'function_handle');
p = inputParser;
p.addRequired('dataOrFid');
p.addRequired('parseFcnHndl',@(x) isa(x, 'function_handle'));

% ...parseFcnHndl is a required argument; the other function handles are
% optional.
p.addParamValue('initFcnHndl',@ap_default_initfcn,@(x) isa(x, 'function_handle'));
p.addParamValue('appenderFcnHndl',@ap_default_appenderfcn,@(x) isa(x, 'function_handle'));
p.addParamValue('cleanupFcnHndl',@ap_default_cleanupfcn,@(x) isa(x, 'function_handle'));

% ...Default textscan format string is '%s', so default behaviour is to
% read in a single, large cell array of strings.
p.addParamValue('textscanFormat', '%s', @ischar);
p.addParamValue('textscanArgs', {}, @iscell);
p.addParamValue('maxChunkLength', intmax, @isnumeric);
p.addParamValue('progressInfo', [], @isstruct);
p.addParamValue('initOptions', [], @isstruct);
p.addParamValue('parseOptions', [], @isstruct);
p.addParamValue('appendOptions', [], @isstruct);
p.addParamValue('cleanupOptions', [], @isstruct);
p.addParamValue('verbosity', NORMAL, @isnumeric);
p.parse(dataOrFid,parseFcnHndl,varargin{:});

dataOrFid = p.Results.dataOrFid;
initFcnHndl = p.Results.initFcnHndl;
parseFcnHndl = p.Results.parseFcnHndl;
appenderFcnHndl = p.Results.appenderFcnHndl;
cleanupFcnHndl = p.Results.cleanupFcnHndl;
textscanArgs = p.Results.textscanArgs;
textscanFormat = p.Results.textscanFormat;
maxChunkLength = p.Results.maxChunkLength;
progressInfo = p.Results.progressInfo;
initOptions = p.Results.initOptions;
parseOptions = p.Results.parseOptions;
appendOptions = p.Results.appendOptions;
cleanupOptions = p.Results.cleanupOptions;
verbosity = p.Results.verbosity;

% ...Do not allow user to over-ride textscan's 'ReturnOnError' parameter.
findIndex = find(cellfun(@ischar,textscanArgs));
charArgs = textscanArgs(findIndex);
charArgs = cellfun(@lower,charArgs,'UniformOutput',false);

if ismember(lower('ReturnOnError'),charArgs)
    error([mfilename '.m--Do not over-ride textscan''s ''ReturnOnError'' parameter.']);
end % if

textscanArgs{end+1} = 'ReturnOnError';
textscanArgs{end+1} = false;

% ...Do not allow user to over-ride textscan's 'Delimiter' parameter
% (adaptive_parse is for newline-separated data only).
if ismember(lower('Delimiter'),lower(charArgs))
    error([mfilename '.m--Do not over-ride textscan''s ''Delimiter'' parameter.']);
end % if

textscanArgs{end+1} = 'Delimiter';
textscanArgs{end+1} = '\n';

% ...Determine if called with a file identifier or with a character array.
if ischar(dataOrFid) || iscell(dataOrFid)
    isFid = false;
else
    isFid = true;
end % if

% ...Ensure that fid is valid.
if isFid
    if dataOrFid < 1
        error([mfilename '.m--Invalid file identifier.']);
    end % if
end % if

% The textscan program can read multiple lines of a file under the control
% of a format string, but it cannot read read multiple lines of a character
% array. Test that the user is not trying to use a custom format string for
% textscan to read in a character array.
if ~isFid && ~strcmp(textscanFormat,'%s')
    error([mfilename '.m--Cannot use textscan to read in a character array; do not specify a value for textscanFormat.']);
end % if

% Set up progress indicator.
if isempty(progressInfo) || verbosity == SILENT
    doShowProgress = false;
else
    doShowProgress = true;
    
    if ~ismember(progressInfo.incrementUnits,{'percent' 'lines' 'bytes'})
        error([mfilename '.m--Unrecognised value for progress increment units.']);
    end % if
    
    if ~ismember(progressInfo.endPointUnits,{'lines' 'bytes'})
        error([mfilename '.m--Unrecognised value for progress endpoint units.']);
    end % if
    
    if strcmp(progressInfo.endPointUnits,'bytes') && ~isFid
        error([mfilename '.m--Cannot use ''bytes'' as a progress increment when not reading in from a file.']);
    end % if
    
    if ~strcmp(progressInfo.incrementUnits,'percent')
        % If not showing progress as percent, then the progress increment
        % and the endpoint must have the same units.
        if ~strcmp(progressInfo.incrementUnits,progressInfo.endPointUnits)
            error([mfilename '.m--Progress endpoint and increment must have same units if not showing progress in percent.']);
        end % if
    end % if
    
    % Assemble progress vector. Vector will be in same units as endPoint.
    if strcmp(progressInfo.incrementUnits,'percent')
        progressDisplayVector = [0:progressInfo.increment:100];
        %progressVector = ([0:progressInfo.increment:100]./100) .* progressInfo.endPoint;
        progressVector = (progressDisplayVector./100) .* progressInfo.endPoint;
    else
        progressDisplayVector = 0:progressInfo.increment:progressInfo.endPoint;
        progressVector = progressDisplayVector;
    end % if
    
    %prevProgressVal = -1;
    prevProgressIndex = -1;
end % if

% Initialise output data structure.
try
    %[parsed,goodLineNums,parsedPtrs] = feval(initFcnHndl,dataOrFid,initOptions);
    [parsed,parsedPtrs] = feval(initFcnHndl,dataOrFid,initOptions);
catch me
    error(sprintf('Failed to initialise output structure, with message:\n   %s',me.message));
end % if

% If input data is a character array or file identifier is already pointing
% at the end of the input file, then return the empty initialised data
% structure.
if (isFid && feof(dataOrFid)) || (~isFid && isempty(dataOrFid))
    return;
end % if

notFinished = true;
numParsesSinceBadLine = Inf;
numLinesRead = 0; % includes both good and bad lines
parseChunkLength = maxChunkLength;
prevParseChunkLength = parseChunkLength;

if isFid
    prevPtr = ftell(dataOrFid);
else
    prevPtr = 0;
end % if

% startLineNum = 1;
% prevStartLineNum = 1;

if verbosity>SILENT
    disp([mfilename '.m--Starting reading with parseChunkLength set to ' num2str(parseChunkLength)])
end

while notFinished    
    thisChunkGrabFailed = false;
    thisChunkParseFailed = true;
    
    % The value of parseChunkLength may change from iteration to iteration,
    % so the construction of the textscan arguments has to be inside the
    % while loop.
    fullTextScanArgs = {textscanFormat, parseChunkLength textscanArgs{:}};
            
    % Grab a chunk of data to parse.
    if isFid
        % Grab the chunk of data from the file.
        try
            prevPtr = ftell(dataOrFid);
            C = textscan(dataOrFid,fullTextScanArgs{:});
            
            % The size of C will be maxChunkLength unless the end of file
            % has been reached, in which case it may be smaller. If an
            % out-of-memory error occurs in parsing, there is no point in
            % re-reading C with maxChunkLength any bigger than C itself, so
            % set it to a suitable power of 2 here (initialising variables
            % to powers of 2 is more efficient). This will reduce the
            % number of iterations taken to get down to a value of
            % maxChunkLength that does not cause a nomem error...this step
            % was also reducing maxChunkLength when bad data lines resulted
            % in parseChunkLength becoming smaller. Only alter value of
            % maxChunkLength when memory problems occur.
            %maxChunkLength = 2^ceil(log2(length(C{1})));
            parseChunkLength = min(parseChunkLength,maxChunkLength);
        catch me
            clear C
            thisChunkGrabFailed = true;

            if strcmp(me.identifier,'MATLAB:nomem')
                % Problem caused by Matlab running out of memory.
                outOfMem = true;
            else
                % Problem caused by something other than Matlab running out
                % of memory. Presumably there is a badly-formed line
                % somewhere in there that does not agree with the specified
                % textscan format.
                outOfMem = false;
                numParsesSinceBadLine = 0;
            end % if
        end % try textscan
        
    else
        % Not reading from file; parsing the contents of the character
        % array "dataOrFid".        
        try
            endPtr = prevPtr+parseChunkLength;
            endPtr = min(endPtr,size(dataOrFid,1));
            C = dataOrFid((prevPtr+1):endPtr,:);
            
            % Should probably reset maxChunkLength and parseChunkLength the
            % way it was done for reading from a file (above), but will
            % wait until I have some data (and time) for testing it.
            
            % Convert from a character array to a cell array if necessary.
            if ~iscell(C)
                C = cellstr(C);
            end % if
        catch me
            %dbcatch;
            clear C
            thisChunkGrabFailed = true;
            
            if strcmp(me.identifier,'MATLAB:nomem')
                outOfMem = true;
            else
                outOfMem = false;
                numParsesSinceBadLine = 0;
            end % if
            
        end % try
    end % if reading from file/not reading from file
    
    if thisChunkGrabFailed
        if outOfMem
            % Grab of this chunk failed because out of memory. Divide the
            % maximum chunk length by 2. If maxChunkLength is Inf, start
            % working down from the value of intmax.
            prevMaxChunkLength = maxChunkLength;
            maxChunkLength = min([maxChunkLength intmax/2 ceil(maxChunkLength/2)]);
            if verbosity>SILENT
                disp([mfilename '.m--Memory capacity exceeded during read operation; changing maxChunkLength from ' num2str(prevMaxChunkLength) ' to ' num2str(maxChunkLength) ' lines.']);
            end 
            parseChunkLength = maxChunkLength;
        else
            % Grab of this chunk failed because textscan was unable to read
            % it using the specified format string. A "bad" line (i.e., of
            % unexpected format) was encountered. Divide the parse chunk
            % length by 2.
            prevParseChunkLength = parseChunkLength;
            newParseChunkLength = min([parseChunkLength intmax/2 ceil(parseChunkLength/2)]);
            %if newParseChunkLength==1,keyboard,end
            
            if prevParseChunkLength ~= newParseChunkLength
                parseChunkLength = newParseChunkLength;
                if verbosity>MINIMAL
                    disp([mfilename '.m--Bad line(s) encountered during read operation; changing parseChunkLength from ' num2str(prevParseChunkLength) ' to ' num2str(parseChunkLength) ' lines.'])
                end
            end % if
            
            numParsesSinceBadLine = 0;
        end % if
    
    else        
        % thisChunkGrabFailed is false. Parse the data.
        if isFid
            Clength = size(C{1},1);
        else
            Clength = size(C,1);
        end % if
        
        thisChunkLineNumbers = (numLinesRead+1:numLinesRead+Clength)';

        try
            outOfMem = false; 
            %[thisChunkParsed,thisChunkGoodLineNums] = feval(parseFcnHndl,C);
            thisChunkParsed = feval(parseFcnHndl,C,thisChunkLineNumbers,parseOptions);
            clear C;
            thisChunkParseFailed = false;
        catch me
            clear C; % added 2011-05-18
            %dbcatch;
            thisChunkParseFailed = true;
            if strcmp(me.identifier,'MATLAB:nomem')
                outOfMem = true;
            end % if
        end % try
        
        %parseSuccess = ~thisChunkParseFailed && status==0;
        %parseSuccess = ~thisChunkParseFailed && isempty(thisChunkGoodLineNums);
        %parseSuccess = ~thisChunkParseFailed;
        %if ~thisChunkParseFailed & parseChunkLength==1,keyboard,end
        
        if ~thisChunkParseFailed
            % No error thrown by parsing function.
            numParsesSinceBadLine = numParsesSinceBadLine + 1;
            
            % Bump up the number of lines parsed until they reach the
            % maximum. On assumption that bad lines will tend to clump
            % together, do not bump up until at least two good parses have
            % happened.
            if numParsesSinceBadLine>1 && 2*parseChunkLength<=maxChunkLength
                prevParseChunkLength = parseChunkLength;
                newParseChunkLength = 2*parseChunkLength;                
                parseChunkLength = newParseChunkLength;
                if verbosity>MINIMAL
                    disp([mfilename '.m--Successful parse; changing parseChunkLength from ' num2str(prevParseChunkLength) ' to ' num2str(parseChunkLength) ' lines.'])
                end 
            end % if
            
        else
            % Error thrown in parse operation.
            if outOfMem
                prevMaxChunkLength = maxChunkLength;
                maxChunkLength = min([maxChunkLength intmax/2 ceil(maxChunkLength/2)]);
                if verbosity>SILENT
                    disp([mfilename '.m--Memory capacity exceeded during parse operation; changing maxChunkLength from ' num2str(prevMaxChunkLength) ' to ' num2str(maxChunkLength) ' lines.']);
                end
                parseChunkLength = maxChunkLength;
            else
                numParsesSinceBadLine = 0;
                prevParseChunkLength = parseChunkLength;
                newParseChunkLength = min([parseChunkLength intmax/2 ceil(parseChunkLength/2)]);
                
                if parseChunkLength~=newParseChunkLength
                    %if newParseChunkLength==1,keyboard,end
                    parseChunkLength = newParseChunkLength;
                    if verbosity>MINIMAL
                        disp([mfilename '.m--Bad line(s) encountered during parse operation; changing parseChunkLength from ' num2str(prevParseChunkLength) ' to ' num2str(parseChunkLength) ' lines.']);
                    end
                end % if
            end % if
            
        end % if 
        
        if ~thisChunkParseFailed
            % The parsing was successful. Append just-parsed data to the
            % previously-parsed data.
            try
                %[parsed,parsedPtrs,goodLineNums] = feval(appenderFcnHndl,parsed,thisChunkParsed,parsedPtrs,goodLineNums,thisChunkGoodLineNums,numLinesRead);
                [parsed,parsedPtrs] = feval(appenderFcnHndl,parsed,thisChunkParsed,parsedPtrs,appendOptions);
                thisChunkAppendFailed = false;
            catch me
                % Added handling of out of mem 2011-05-18
                if strcmp(me.identifier,'MATLAB:nomem')
                    clear thisChunkParsed;
                    thisChunkAppendFailed = true;
                else
                    error(sprintf('Failed to append parsed data to output structure, with message:\n   %s',me.message));
                end % if
            end % if
        end % if
        
    end % if ~thisChunkGrabFailed
    
    % Will advance pointer if there was a successful parse OR if there
    % was an unsuccessful parse of a single line.
    doAdvance = false;
    
    if (~thisChunkParseFailed && ~thisChunkAppendFailed) || (thisChunkParseFailed && prevParseChunkLength == 1)
        doAdvance = true;
    end % if
        
    if doAdvance
        if thisChunkParseFailed && parseChunkLength == 1
            % Parsing this single line failed, so advance by 1 line.
            advanceBy = 1;
            if verbosity>MINIMAL
                disp([mfilename '.m--Bad data located at line ' num2str(numLinesRead+1) '. Skipping.']);
            end
            
            % If reading from a file, rewind to previous point in file,
            % then read in and discard entire line. 
            if isFid
                fseek(dataOrFid,prevPtr,'bof');
                fgetl(dataOrFid);
            end % if
            
        else
            % doAdvance is true and this was not an unsuccessful parse of a
            % single line, so this must have been a successful parse;
            % advance by the number of lines that were passed to the
            % parsing function.
%             if isFid
%                 advanceBy = size(C{1},1);
%             else
%                 advanceBy = size(C,1);
%             end % if
            advanceBy = Clength;
        end % if
        
        numLinesRead = numLinesRead + advanceBy;
        
        % Advance the data pointer.
        if isFid
            % Do nothing--the file pointer advanced when textscan was
            % called, and a value of prevPtr was obtained just before call
            % to textscan.
        else
            prevPtr = endPtr;
        end % if
                
        if doShowProgress            
            if strcmp(progressInfo.endPointUnits,'bytes')
                currProgress = ftell(dataOrFid);
            elseif strcmp(progressInfo.endPointUnits,'lines')
                currProgress = numLinesRead;
            end % if

            findIndex = find(progressVector<=currProgress,1,'last');            
            
            if findIndex > prevProgressIndex
                prevProgressIndex = findIndex;
                currProgressDisplay = progressDisplayVector(findIndex);
                if strcmp(progressInfo.incrementUnits,'percent')                    
                    disp(sprintf('%.0f percent parsed',currProgressDisplay));
                elseif strcmp(progressInfo.incrementUnits,'bytes')
                    disp(sprintf('%d of %d bytes parsed',currProgressDisplay,progressInfo.endPoint));
                else
                    disp(sprintf('%d of %d lines parsed',currProgressDisplay,progressInfo.endPoint));
                end % if
            end % if
        end % if doShowProgress
        
    else
        % doAdvance is false; instead of advancing pointer, rewind it to
        % try parsing with a shorter chunk length. 
        if isFid
            fseek(dataOrFid,prevPtr,'bof');
        else
            % No need to rewind prevPtr if it is pointing to a
            % character array; it just keeps its old value.
        end % if
    end % if
    
    % Determine whether to break out of loop or not.
    if isFid
        if feof(dataOrFid)
            notFinished = false;
        end % if
    else
        if numLinesRead>=size(dataOrFid,1)
            notFinished = false;
        end % if
    end % if
    
end % while notFinished

% Perform cleanup on the parsed data (remove trailing NaNs from
% over-initialised vectors, etc.).
%[parsed,goodLineNums] = feval(cleanupFcnHndl,parsed,goodLineNums,parsedPtrs);
parsed = feval(cleanupFcnHndl,parsed,parsedPtrs,cleanupOptions);

%-------------------------------------------------------------------------
function [parsed,parsedPtrs] = ap_default_initfcn(dataOrFid,initOptions)

% ap_default_initfcn.m--Default initialisation function; returns empty
% variables. 
%
% Without proper initialisation, newly-parsed data can only be appended to
% existing variables, resulting in the inefficiencies associated with
% dynamic memory allocation. For this reason, it is highly recommended that
% this default function be replaced with one customised for use with your
% data. Use this default function as a guide to writing your custom
% version.
%
% Any initialisation function must take two input arguments (even though
% their actual use by the function is optional):
%
%  dataOrFid--This parameter (either a file identifier (fid) or a
%  cell/character array) can be used to determine the expected size of the
%  output data so that the "parsed" output variable can be initialised to
%  the right size.
%
%  initOptions--A structured variable containing whatever additional
%  information you need to pass to your initialisation function (it can
%  contain fields like "initOptions.numberOfLines", for example).
%
%  Output arguments are:
%
%    parsed--The initialised structured variable, created to be compatible
%    with the data appender function.
%
%    parsedPtrs--Structured variable containing pointers to each of the
%    variables to be output as fields of the "parsed" output argument.
%    Initialise to be compatible with the data appender and cleanup
%    functions (for example, parsedPtrs.temperature=0; parsedPtrs.depth=0;
%    etc.).
%
% Syntax: [parsed,parsedPtrs] = ap_default_initfcn(dataOrFid,initOptions)

% Developed in Matlab 7.11.0.584 (R2010b) on GLNX86
% for the VENUS project (http://venus.uvic.ca/).
% Kevin Bartlett (kpb@uvic.ca), 2011-03-22
%-------------------------------------------------------------------------

parsed = struct;
parsedPtrs = struct;

%-------------------------------------------------------------------------
function [parsed,parsedPtrs] = ap_default_appenderfcn(parsed,thisChunkParsed,parsedPtrs,appendOptions)
%
% ap_default_appenderfcn.m--Default appender function. Use this default
% function as a guide to writing your version, customised for your own
% data.
%
% This default appender function appends all parsed data to fields of the
% simple structured variable "parsed". A custom version must also return
% the "parsed" argument, but could make it a more complex structured
% variable, with nested fields, etc. Could also achieve efficiencies by
% appending only certain, desired fields, rather than all of the parsed
% fields. 
%
% The appender function takes "parsed" both as an input argument and an
% output argument. "parsedPtrs" is also both an input and an output
% argument. "parsedPtrs" is used by the appender function itself to insert
% data into the appropriate place in pre-initialised fields of "parsed"; if
% not pre-initialising (i.e., if dynamically appending data) then
% parsedPtrs can be just left as an empty variable and not used.
%
% Other input arguments are:
%
%   thisChunkParsed--Structured variable containing the parsed data from
%   this "chunk" of the data.
%
%   appendOptions--A structured variable containing whatever additional
%   information you need to pass to your appending function (it can
%   contain fields like "appendOptions.verbosity", for example).
%
% Syntax: [parsed,parsedPtrs] = ap_default_initfcn(parsed,thisChunkParsed,parsedPtrs,appendOptions)

% Developed in Matlab 7.11.0.584 (R2010b) on GLNX86
% for the VENUS project (http://venus.uvic.ca/).
% Kevin Bartlett (kpb@uvic.ca), 2011-03-22
%-------------------------------------------------------------------------

% Append the new data to the "parsed" structure.
fieldNames = fieldnames(thisChunkParsed);

for iField = 1:length(fieldNames)
    thisFieldName = fieldNames{iField};
    thisFieldLength = length(thisChunkParsed.(thisFieldName));
    
    % Use the pointer in the corresponding field of parsedPtrs to insert
    % the current data to the correct place in "parsed" (if parsed was not
    % pre-initialised, or was pre-initialised too short, this has the same
    % effect as simply appending the current data).
    if ~isfield(parsedPtrs,thisFieldName)
        % The pointer field for this field has not been initialised.
        parsedPtrs.(thisFieldName) = 0;
    end % if
    
    thisFieldStartPtr = parsedPtrs.(thisFieldName)+1;
    
    thisFieldEndPtr = thisFieldStartPtr + (thisFieldLength-1);
    parsed.(thisFieldName)(thisFieldStartPtr:thisFieldEndPtr) = thisChunkParsed.(thisFieldName);
    
    % Keep track of the end of the parsed data inside the initialised
    % "parsed" variable.
    parsedPtrs.(thisFieldName) = thisFieldEndPtr;
end % for

%-------------------------------------------------------------------------
function parsed = ap_default_cleanupfcn(parsed,parsedPtrs,cleanupOptions)
%
% ap_default_cleanupfcn.m--Default cleanup function. Does nothing at all. A
% custom version would crop the parsed variable to its actual length using
% the parsedPtrs structured variable (if parsed and had been initialised to
% be too long).
%
% Syntax: parsed = ap_default_cleanupfcn(parsed,parsedPtrs,cleanupOptions)

% Developed in Matlab 7.11.0.584 (R2010b) on GLNX86
% for the VENUS project (http://venus.uvic.ca/).
% Kevin Bartlett (kpb@uvic.ca), 2011-03-22
%-------------------------------------------------------------------------

% Do nothing.

Contact us