Retaining order of columns when importing non-rectangular data

2 views (last 30 days)
I am trying to import data from multiple files, all of which begin with several empty rows, and the number of empty rows varies column-by-column. One such file looks like this:
frame1 empty empty empty data data data
frame2 empty empty empty data data data
frame3 data data data data data data
frame4 data data data data data data
etc...
When I use dlmread to import starting at the first cell (frame1), Matlab loads the first column correctly, but shifts the columns containing data (5 through 7 above) over to the left so that the data cells now occupy columns 2:4 instead of 5:7. This weird leftward shift happens for every row that contains empty cells that are left of actual data. Once all columns contain data (rows beginning with frame3 and frame4 above) the data imports normally.
How can I load the data in a way that preserves data in the proper cells? Every file has a different arrangement of empty cells up top, so I'd prefer an automated fix rather than handling each file separately. Thanks very much in advance. - Nate

Answers (2)

Star Strider
Star Strider on 17 May 2015
Without seeing a representative example of your files, it’s difficult to suggest an exact solution.
However, I would see if textscan can import them. If they all start with a minimum number of empty rows, you can use the 'HeaderLines' name-value pair to skip them. Beyond that, if you are importing numeric data and there are blanks or strings in those positions, textscan will fill them with NaN values. You can deal with the NaN values later, but the rows and columns should all line up and import correctly otherwise.
Again, I don’t have one of your files to experiment with myself, so this is a guess on my part.
  6 Comments
Walter Roberson
Walter Roberson on 18 May 2015
TABS?? The whole question could have been answered by passing the optional delimiter to dlmread!
dlmread('YourFile.txt', '\t')
I spent more than an hour and a half yesterday and this morning, researching and writing up solutions for the case where the columns are delimited by blanks. The case where the columns are delimited by tabs is trivial!
Star Strider
Star Strider on 18 May 2015
That was a belated discovery on my part. I opened it in Notepad and looked to see how the columns were separated. They looked like tabs, so I tried that with textscan, found that sort of worked (on the first line at least), added the 'EndOfLine' pair, and it imported the entire file!
I’m not sure dlmread would work because there’s a prodigious amount of text in that file. I don’t use dlmread often, but the documentation says it doesn’t like text. I will defer to your experience with it if I’m in error.

Sign in to comment.


Walter Roberson
Walter Roberson on 17 May 2015
I am going to refer you to another answer I just wrote for some background; please scan over it
Now, dlmread() uses the undocumented textscan() format I mentioned there, the empty string, '' . And when it does, column spacing is not paid attention to, the Whitespace characters are skipped (except as needed to establish breaks in fields). So you get the effect the original poster noted, that the fields are shifted over.
If you look at the documented import functions then the only one that talks about "column" data is readtable(). Unfortunately as you look through the documentation for readtable() it becomes clear that it uses textscan(). A quick test shows it has the same difficulty, inability to deduce (or deal with) numeric columns.
How to deal with this file then?
One answer is to use counted character fields to textscan, such as '%7c' to tell it to read whatever 7 characters are there, and to pass the 'Whitespace', '' parameter pair to textscan. Leading blanks are not skipped if you use the empty string for 'Whitespace', but if you give any other string for the 'Whitespace' option, then blank (' ') is automatically added to the list, causing leading whitespace to be skipped. Even with '%c' formats.
Anyhow, once the file was sliced up into columns of the appropriate width, str2double() could be used on the resulting cell arrays to convert to numeric values. str2double() will treat the empty string and strings of blanks as invalid numbers and will return NaN in those positions.
It is tempting to try to use a mix of fixed-width integer formats together with counted character formats to parse files where some columns might be known to always be present whereas other columns might be empty. This temptation is sure to lead to frustration if not grief.
Regardless of whether 'Whitespace' is the empty string:
  • If the stream is positioned to the beginning of a line, and the first item is a numeric format, then arbitrary amounts of blanks will be skipped to position for reading the field; this occurs even when 'whitespace' is the empty string.
  • if the next format item is a numeric field and the current position is not a valid number-forming character, then a mismatch is raised, and all fields from there to the end of the format string are given the empty value and the use of the format string is terminated (textscan returns). The other rules I describe set conditions up so that if 'Whitespace' is not the empty string then it is not possible to be positioned at a blank when starting to read a numeric field. If 'Whitespace' is the empty string, then it is possible to be positioned at a blank when starting to read a numeric field, and as blank is not a valid number-forming character, the format mismatch would be raised.
  • if the next format item is a numeric field, no matter whether you use %d or %g, a full exponential-format number allowed on input; the number read will be converted to the appropriate datatype afterwards. For example '1.2e3' with a '%d' format will not end at the '1': the number would be interpreted as 1200. In some older releases, if you have digits followed immediately by 'd' or 'e' followed by non-digits non-sign (for signed exponent), then the 'd' or 'e' would be silently eaten; in R2014a, '1dq' with %d format would terminate parsing at the '1' leaving the 'dq' in the buffer, but still '1d3' would be considered a single number. If you have two adjacent fields with no delimiter between them, and you are counting on a numeric field being terminated by the character following, you can be in for surprise if that character just happens to be 'd' or 'e'. Beware reading numbers without using a field width unless there is a distinct delimiter after them!
When Whitespace is not the empty string:
  • after any field is read, regardless of whether it is a numeric or a character format, trailing whitespace is skipped over in the buffer, leaving the position pointed to a non-whitespace character. Notice that this is done after reading the field. It can be difficult to distinguish whether whitespace skipping is done before or after reading fields; after is the easiest to explain with other behaviors.
I know I have lost track of at least one detail of why it is hard to mix numeric formats and %c formats when there are empty fields, even when 'Whitespace' is the empty string; it is very late and I will try to remember later.
If one is going to be using textscan() just to partition into fixed-width strings, then it is hardly worth using textscan(). One might as well read the entire file and process it as character strings.
The below code can be made faster if all lines are exactly the same length (including trailing blanks):
file_as_str = fileread('YourFile.txt');
as_lines = regexp(file_as_str, '^.*$', 'match', 'lineanchors', 'dotexceptnewline');
as_array = char(as_lines);
and then as_array could be accessed by column.
If all of the lines were certain to be the same length, then supposing that length is L (excluding the newline):
file_as_str = fileread('YourFile.txt');
if mod(length(file_as_str), L+1) ~= 0
%final newline might be absent. Add it in
file_as_str(end+1) = char(10);
end
as_array = reshape(file_as_str, L+1, []).';
as_array(:,end) = []; %remove the newlines if desired
  1 Comment
Nathaniel Smith
Nathaniel Smith on 18 May 2015
Hi Walter - thanks for the thorough answer! Conceptually, I understand the importance of using the empty string '' for whitespace, in order to avoid skipping over empty cells. However, I'm not especially familiar with the subtleties of textscan and I'm at a loss as to how to apply your advice to my situation.
Your first chunk of example code successfully reads in the entire file, character by character, and separates it into lines. However, my files (like the example file above) contain cells with varying numbers of characters. How could I go about breaking these huge lines of characters into cells, as they appear in the input file? Thanks very much - Nate

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!