Asked by Thomas
on 15 Jan 2013

Hi

I have a text file containing a text header, and rows containing numeric values, with varying numbers of values, characters and numeric formats:

# Bundle file v0.3 9 2532 6.8302313857e+002 -1.4826175815e-001 8.1715222947e-002 9.3709731863e-001 -2.8772865743e-001 -1.9763814183e-001 194 144 45 5 6 1496 289.0000 199.0000 7 1235 308.0000 125.0000 5 1614 285.0000 163.0000 4 2122 173.0000 142.0000 0 911 148.5000 165.5000 2.4321163035e+000 -9.1469082482e-001 -6.6122261943e+000 219 194 76

I want to remove the header and store each of the numeric values in a matrix (padded out with NaNs to compensate for the dimensional differential). At present, I am using this code:

% open file and save contents to cell array, c fid = fopen('C:\transform\bundle.out','r'); c = textscan(fid,'%s','delimiter', '','whitespace',''); fclose(fid);

%create m x 1 cell C and remove the header C = c{1}; C(1,:)=[];

% convert C to a matrix using cell2mat / cellfun maxLength=max(cellfun(@(x)numel(x),C)); out = cell2mat(cellfun(@(x)cat(2,x,zeros(1,maxLength-length(x))),C,'UniformOutput',false));

The problem with this approach is that it creates a character array where each row is a string meaning that I cannot use str2num or str2double to convert the numeric values to discrete doubles (i.e. it gives [] / NaN due to not passing the arithmetic number test). I.e. it produces:

'9 2532 '; '6.8302313857e+002 -1.4826175815e-001 8.1715222947e-002 '; '9.3709731863e-001 -2.8772865743e-001 -1.9763814183e-001';

rather than:

'9' '2532'; '6.8302313857e+002' '-1.4826175815e-001' '8.1715222947e-002'; '9.3709731863e-001' '-2.8772865743e-001' '-1.9763814183e-001';

I can work around this using by seperating each row into a row vector (e.g. out1,..,outn then using:

splitstring = textscan(out1,'%s'); splitstring = splitstring{1};

Then use str2double and flipdim or similar to return rows of doubles, then use vertcat and pad with NaNs to get the desired matrix, but this seems to be very wieldy in the coding department. Can anyone suggest a more simple way of getting the desired output? Any suggestions would be appreciated.

Thomas

Answer by Thomas
on 16 Jan 2013

Accepted answer

I have worked out the answer for those with a similar problem:

I use textscan and cellfun to split the strings, de-nest and rearrange the output using vertcat and cellfun/transpose, then convert the single strings to doubles using cellfun/str2double:

fid = fopen('C:\transform\bundle.out','r'); c = textscan(fid,'%s','delimiter', '','whitespace','', 'HeaderLines', 1); fclose(fid); C = c{1}; C = cellfun(@(x) textscan(x,'%s','Delimiter', ' ')',C ,'UniformOutput',false); Y = vertcat(C{:}); X = cellfun(@transpose,Y,'UniformOutput',false); Z = cellfun(@str2double,X,'UniformOutput',false);

The output can be gained using cellfun/cell2mat using a max row length id (maxLength):

maxLength=max(cellfun(@(x)numel(x),Z)); out = cell2mat(cellfun(@(x)cat(2,x,zeros(1,maxLength-length(x))),Z,'UniformOutput',false));

Note this code pads out the values with zeros rather than NaNs.

Answer by per isakson
on 15 Jan 2013

Edited by per isakson
on 17 Jan 2013

If the file isn't huge (compared to available RAM and address space) and you have an idea of the maximum number of columns "columns" and rows, then I guess the simplest way is to loop over all rows.

M = nan( nrow, ncol ); % allocate memory

fid = fopen( ... );

str = getl( fid ); % header line row = 0; while not( eof(fid) ) row = row + 1; str = fgetl( fid ); val = fscanf( str, '%f' ); M( row, 1:numel(val) ) = val; end

And trim M. Something like this.

.

**[Edit: 2013-01-16]**

**Working code**

Here is a comparison between three solutions. The two first, cssm and cssm1 are along my out-line above. The last, OP, is the one proposed by OP. I run this script a few times.

%% read ragged text file clc tic, M1 = cssm; toc tic, M2 = cssm1( 10000, 100 ); toc tic, M3 = cssm1( 100000, 1000 ); toc tic, M4 = OP(); toc

which return

Elapsed time is 0.238691 seconds. Elapsed time is 0.131869 seconds. Elapsed time is 0.960397 seconds. Elapsed time is 0.709025 seconds.

The output is

>> whos Name Size Bytes Class Attributes M1 2464x21 413952 double M2 2464x21 413952 double M3 2464x21 413952 double M4 2464x21 413952 double

.

In cssm.m the required number of rows and columns are determined in two separate steps. Each step reads the file. Thus, the function, cssm, reads the file three time.

With cssm1 the number of rows and columns are guessed. In one case the "guesses" are 4x the actual size and in the other 40x.

The function, OP, is OP's code made into a function and ZEROS replaced by NAN to honor the question.

With 2500 rows cssm is three times faster than the loop-free code (OP). cssm is five times faster when allocating 4x4 times more memory than needed and a bit slower than the loop-free code when allocating 40x40 timed more memory.

**Conclusions:**

- Loops are not always slow
- Reading from the file cache is fast.
- Code with loops are often easier to make and understand (IMO).
- Don't hesitate to use the RAM if it is available

.

The files involved are

function M = cssm()

fid = fopen( 'cssm.txt' ); cup = onCleanup( @() fclose( fid ) );

cac = textscan( fid, '%s', 'Delimiter', '\n', 'HeaderLines', 1 ); nrow = numel( cac{:} ); clear cup

fid = fopen( 'cssm.txt' ); cup = onCleanup( @() fclose( fid ) ); [~] = fgetl( fid ); % header line

ncol = 0; while not( feof( fid ) ) ncol = max( ncol, numel( sscanf( fgetl(fid), '%f' ) ) ); end clear cup M = cssm_( nrow, ncol ); end function M = cssm_( nrow, ncol ) M = nan( nrow, ncol ); % allocate memory fid = fopen( 'cssm.txt' ); cup = onCleanup( @() fclose( fid ) ); [~] = fgetl( fid ); % header line row = 0; while not( feof( fid ) ) row = row + 1; val = sscanf( fgetl(fid), '%f' ); M( row, 1:numel(val) ) = val; end end

and

function M = cssm1( nrow, ncol ) M = nan( nrow, ncol ); % allocate memory fid = fopen( 'cssm.txt' ); cup = onCleanup( @() fclose( fid ) ); [~] = fgetl( fid ); % header line row = 0; while not( feof( fid ) ) row = row + 1; val = sscanf( fgetl(fid), '%f' ); M( row, 1:numel(val) ) = val; end M( :, all( isnan( M ), 1 ) ) = []; M( all( isnan( M ), 2 ), : ) = []; end

The text file, cssm.txt,contains 2465 line; repetitions of OP's data.

Thomas
on 16 Jan 2013

Thanks for your response

Unfortunately, the number of rows is unknown, as is the number of variables and characters in each row (i.e. the example in the original question). A for loop may work, though acting on the cell array might be more RAM friendly. I'll have a look at a possible solution.

per isakson
on 16 Jan 2013

I have added working code above to illustrate the approach I proposed.

Answer by Ryan Livingston
on 15 Jan 2013

Will think more about the harder question of formatting the numeric data but you could use the properties 'CommentStyle' and/or 'HeaderLines' to skip your header.

Missing numeric fields are indeed padded with NaNs by default so doing:

a = textscan(fid, '%f %f %f\n',1,'HeaderLines',1)

returns:

a =

[9] [2532] [NaN]

This is controlled by the property 'EmptyValue'. Getting the right format string and properties will do all of the padding for you.

Could you elaborate on the desired format of the output array? Are you viewing the text file as a matrix and you would like the dimensions to be number_of_lines - by - max_number_of_values (8 -by- 16 in this example) or something else?

Thomas
on 16 Jan 2013

Hi

The desired output would be number of rows (unknown) by maximum number of values:

[9 2532 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN; 6.83e+002 -1.48e-001 8.17e-002 NaN NaN NaN NaN NaN NaN NaN NaN NaN; 9.37e-001 -2.87e-001 -1.97e-001 NaN NaN NaN NaN NaN NaN NaN NaN NaN; 194 144 45 NaN NaN NaN NaN NaN NaN NaN NaN NaN; 5 6 1496 289.0000 199.0000 7 1235 308.0000 125.0000 5 1614 285.0000]

With the maximum number of values in this case being 11 (< max row padded with NaN).

Opportunities for recent engineering grads.

## 0 Comments