Reading multiple excel files with data store when the columns of the files change position

3 views (last 30 days)
Hi. I am trying to read hundreds of excel files with data store, but the files vary the position of the columns. Data Store reads the position of the columns in the first file and assumes that it is the same for the rest. when reviewing the data of the last excel files does not match, because its position was different. how do i solve it?
thanks.
  2 Comments
Guillaume
Guillaume on 31 Oct 2019
Unfortunately, it's not something that's supported by the SpreadsheetDatastore. In theory, you would have to write your own custom datastore to handle this but it would be a lot of work.
However, have all the files got the exact same columns just in different position? If so, you could handle it with a TransformedDatastore. If you can attach two sample files (actual content can be rubbish as long as they've got the correct format), I can show you how to do that.
Juan Carlos Pozuelos Buezo
Thanks Guillaume, unfortunately the files has diferent comunt of columns. buy, i whant to work with the same columns in all files. The point is that columns are en diferent positions.

Sign in to comment.

Accepted Answer

Guillaume
Guillaume on 31 Oct 2019
This works with the very simple test case I built (of two excel files with columns in different order and the 2nd file with extra columns).
First create your datastore as normal, e.g:
ds = datastore('C:\somewhere\*.xlsx');
ds.SelectedVariableNames = {'Something', 'SomethingElse', 'Whatever'};
Then create a TransformedDatastore with that underlying datastore:
tds = transform(ds, @reorderds, 'IncludeInfo', true);
with the following reorderds.m file:
function [tout, infoin] = reorderds(tin, infoin)
%reorder the variables of a table read by datastore.read so that they match what is actually in the file
%works with a spreadsheetdatastore only
currentvars = tin.Properties.VariableNames; %these names come from the first file in the datastore and may not match the actual names in the file
opts = detectImportOptions(infoin.Filename, 'Sheet', infoin.SheetNames{1}); %so get the actual names with detectimportoptions
actualvars = opts.VariableNames;
[found, where] = ismember(currentvars, actualvars); %find actual location of the variables in the file
assert(all(found), 'Some expected variables not found in %s', infoin.Filename);
tout = tin(:, where); %reorder columns
tout.Properties.VariableNames = tin.Properties.VariableNames; %and set the name back to what they should be
end
and use tds instead of ds. Any read or readall operation with tds will automatically reorder the columns to their true location.
There will be a performance penalty to this since in effect each file is parsed twice, once by the internal datastore read and another time in reorderds by the detectImportOptions.

More Answers (1)

Roofus Milton
Roofus Milton on 31 Oct 2019
I would use a cell array populated by readtable then simply reference the colums by name.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!