MATLAB Answers

Reading and processing data from text file to matlab variable quickly

25 views (last 30 days)
Paolo Binetti
Paolo Binetti on 25 Feb 2017
Edited: per isakson on 3 Mar 2017
I use the following code to read data from a text file and process it into two cell arrays, and it works, but can it be done faster? Although I currently need the cell array data format for the downstream code that uses the data, I am also open to consider other data types, if they help reading more quickly from the text file.
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp(adjlist, '\w*(?= )', 'match');
nodes = cell2mat(nodes);
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');

  2 Comments

dpb
dpb on 25 Feb 2017
The time overhead is likely not in the file reading portion but the regexp processing afterwards; it is pretty notorious for not being a performance speed demon. You're reading the file as just a cellstr array so I suspect that's not the issue.
Try breaking out the fileread from the surrounding regexp and profile the result; I'll be quite surprised if the above supposition doesn't turn out to be true.
Paolo Binetti
Paolo Binetti on 26 Feb 2017
You are right, the bottleneck are the three regexp instructions. I have reworded my question slightly, I hope it is clearer. Or do you suggest recasting the problem just in term of regexp?

Sign in to comment.

Accepted Answer

per isakson
per isakson on 26 Feb 2017
Edited: per isakson on 26 Feb 2017
"Reading and processing data from text file to matlab variable quickly" &nbsp The short answer is that using textscan to read and do most of the parsing is faster. And gives cleaner code.
It's a bit tricky to measure the speed of reading small files, since the file will be available in the system cache after the first test. However, it's safe to claim that in this case texdtscan is faster.
Run this
>> [nodes,edges,cac] = cssm();
Elapsed time is 0.054037 seconds.
Elapsed time is 0.009937 seconds.
>> cac(:)
ans =
{3001x1 cell}
{3001x1 cell}
where
function [nodes,edges,cac] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
toc
end
&nbsp
A more fair comparison:
>> [nodes,edges,n2,e2] = cssm();
Elapsed time is 0.047859 seconds.
Elapsed time is 0.014726 seconds.
>> edges{1}
ans =
'3' '5' '9'
>> e2{1}
ans =
'3' '5' '9'
where three lines are added to produce the data on the same format
function [nodes,edges,n2,e2] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
n2 = cac{1}; % new
e2 = regexp( cac{2}, ',', 'split' ); % new
e2 = reshape( e2, 1,[] ); % new
toc
end

  7 Comments

Show 4 older comments
Walter Roberson
Walter Roberson on 1 Mar 2017
Cell arrays require:
  • 8 bytes per cell, whether used or not
  • plus 104 bytes per non-empty cell, which includes the size and type information for the cell
  • plus the storage for the obvious data of the cell. For character strings, that is 2 bytes per character.
For a fully occupied cell array, that is 112 bytes per cell plus the obvious data of the cell. (And you have to add to that, whatever storage is used to represent the size and type information of the variable that is the cell array header.)
If you were to use a blank-padded rectangular region, then that would be 2 bytes per character, times number of rows, times number of columns; to which you would add whatever storage is used to represent the size and type information of variable (probably the same cost as the a cell array header.) You would be wasting some of those columns with the blank padding.
You have not happened to indicate anything about minimum and maximum and typical row size. If the occupancy was uniform random (unlikely), then on average half of the columns would be unused; in that situation if the fixed width were at least twice 112 bytes, which you would get with 112 characters wide, then the average waste would be the same as the cell overhead. However, uniform random is not typical, really: more typical is that either there is not much variation in sizes (e.g., if the variation were just between 3 and 5 fields) --- or else that most of the data is relatively short but a small fraction if it is really large (power law), in which case if you allocate as if everything could be the longest then you could waste a lot.
In terms of timing, access into a rectangular array is faster, but it is not all that different for a single level of cell nesting.
dpb
dpb on 1 Mar 2017
The final line of strsplit after all the preprocessing is
% Split.
[c, matches] = regexp(str, aDelim, 'split', 'match');
so guess it stands to reason it's going to be slower... :)
per isakson
per isakson on 2 Mar 2017
"more efficient way to store strings of different lengths" &nbsp I guess, that there is no one-size-fits-all.
  • "efficient" regarding memory use and computational speed may conflict.
  • The number of strings to store
  • The variation in length of the strings as Walter pointed out.
  • Which operations will be done on the set of strings.
  • Whether or not strictly "write-once-read-many"
  • Does the cost of making the program/code count?
  • And more ... .
Regarding character arrays: "'first','second','third'" should be stored as
fst
ieh
rci
sor
tnd
d
since Matlab is column major. This is tricky to read when debugging.
I recently had a problem:
  • a fraction of a million valid Matlab variable names. Most names are short, but some are long. (No, I don't use them in expressions with EVAL.)
  • searches typically returns a dozen names
Solution:
  • store all names in one row separated by char(31), huge_str. char(31) is displayed as space by editors.
  • store the positions of char(31) to avoid repeated use of strfind(huge_str)
  • use STRFIND and REGEXP in searches
My resulting code is fast and memory efficient, but it did require some debugging.
Is this undocumented use of char(31), which might not survive next Matlab release? I don't think the use of char(31) is mentioned in the Matlab documentation.

Sign in to comment.

More Answers (0)

Sign in to answer this question.