Merging table rows, keep all columns

I'm trying to combine data from multiple tables into one. (data files attached). Seems like a simple join(), or outerjoin(), but every path has run into issues.
Specifically what I want to do:
  1. Add rows from table 2 to table 1.
  2. Keep all rows in both tables (append rows)
  3. Where column names match, use that column
  4. Where columns are new, add column to table width
  5. Keep column names (outer join is renaming based on source table)
  6. Some table values are empty and should combine as empty values in existing and/or new columns as needed.
Tried so far:
  1. Join - Fails do to some empty values
  2. Join w/Replaced nan - fails do to some other key value error
  3. outerjoin() w/multiple configuration options - all failed.
  4. innerjoin90 - does not seem like what I want (throwing out data).
When done combining the attached tables there should be slight more columns than the first table, and rows should be the sum of rows in both tables.
This should be a common issue so assuming I am missing some simple solution...?
Using Matlab 2016b
Marc

6 Comments

Adam Danz
Adam Danz on 13 Nov 2019
Edited: Adam Danz on 13 Nov 2019
A small example of table1 and table2 with column names and an example of output table3 would really solidify the goal.
Also, attaching data is almost always helpful. But in order to reduce the time investment of volunteers, it is also helpful to provide the code you're using to read in the data or to just provide a mat file with the data already present.
Thx for the feedback. Input data tables are attached to the original post. They are wide - creating a sample output manually does not make sense, so I was explicit in the goals instead. Code is currently:
TResults = join(TResults,T);
%TResults = [TResults;thisResult];
Neither of these work, and I have tried many command line options as well (not available to paste/include here).
Adam Danz
Adam Danz on 13 Nov 2019
Edited: Adam Danz on 13 Nov 2019
Creating a sample output doesn't mean it must match your data's dimensions.
Example input-outputs are often helpful in supplying a minimal working example.
The code you supplied doesn't read in the data. The reason why it's important to supply the code you're using to read in the data is 1) if we do that it may not match what you're doing and the answer may therefore be less helpful and 2) it causes us to invest a lot more time than what is needed to get to the main problem.
Here is the code which cycles through the files:
thisPath = uigetdir(pwd);
fnames = dir(fullfile(thisPath, '*.csv'));
for n = 1:size(fnames,1)
fname = fnames(n);
sName = fname.name;
if(strfind(sName, 'RESULTS'))
disp(['Processing Result File: ',sName]);
T = ParseThisResultFile(thisPath, sName);
T = fillmissing(T,'constant',0,'DataVariables',@isnumeric);
if(~isempty(TResults))
TResults = join(TResults,T);
%TResults = [TResults;thisResult];
else
TResults = T;
end
end
The function below has been tested to verify it is returning a table with correct columns/data as expected.
function [resultSet] = ParseThisResultFile(thisPath, sName)
% Format name and get CSV file to local parameter
thisFName = fullfile(thisPath, sName);
thisFName = strrep(thisFName,'.CSV','.csv');
T = readtable(thisFName,'Delimiter',',');
resultSet = T;
end
I've read-in your tables and the column names match between both tables. Points 3 and 4 in your question (thanks for the numbering - that makes this easy to discuss) mention column names that do not match. Are there supposed to be column names that do not match?
I should add that upon reading in your table, Matlab had to modify some of the column names to conform to Matlab syntax.
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table.
The original column headers are saved in the VariableDescriptions property.
Set 'PreserveVariableNames' to true to use the original column headers as table variable names.
files = {'RESULTS_SAMP1.CSV', 'RESULTS_SAMP2.CSV'}; %Full paths are always better
T1 = readtable(files{1},'Delimiter',',');
T2 = readtable(files{2},'Delimiter',',');
% Do column names match?
all(ismember(T1.Properties.VariableNames, T2.Properties.VariableNames)) % Yes
all(ismember(T2.Properties.VariableNames, T1.Properties.VariableNames)) % Yes
Tried fixing names first with 'PreserveVariableNames', but this did not work. "No public property PreserveVariableNames exists for class matlab.io.text.DelimitedTextImportOptions." Lesser issue compared to others.
I randomly selected two files and they were giving me merging errors so I thought those had different columns. Some of my data DOES include differences; we can simulate that by deleting the 3rd column int he first table, and 5th column in the second table. (does not matter which we delete, just making them different). What join command will combine these tables keeping all rows, and adding columns as needed to match the data? In some cases there will be missing columns which should be stuffed with empty cells.

Sign in to comment.

 Accepted Answer

Adam Danz
Adam Danz on 13 Nov 2019
Edited: Adam Danz on 13 Nov 2019
% Read in the data
files = {'RESULTS_SAMP1.CSV', 'RESULTS_SAMP2.CSV'}; %Full paths are always better
T1 = readtable(files{1},'Delimiter',',');
T2 = readtable(files{2},'Delimiter',',');
% Simulate column-mismatch
T1 = removevars(T1,'SpecimenType'); % remove col 3
T2 = removevars(T2,'Test'); % remove col 5
% Vertically concatenate tables
T3 = outerjoin(T1,T2,'MergeKeys', true)

4 Comments

Closer!
The removevars did not work (I'm on 2016b). I replaced that with:
T1.SpecimenType = [];
T2.Test = [];
The code worked on the two files we are looking at, but failed on the first file when I went back to the original set of results files (the sample files were slightly modified for posting).
When I run the original files the headers are corrupted; sample below is T(1:3,1:3). If I change anything in the file, for example T(1,3) = 'Whole Blood1' and save the file, headers load OK. In other words just resaving the CSV file corrects what ever is corrupting the header import.
Any ideas?
x00001995__ A081099___Whole Blood___Not
___________ _______________ _____________
'00001994' ' A066029' 'Whole Blood'
'00001993' ' A066027' 'Whole Blood'
'00001992' ' A066022' 'Whole Blood'
And another... in the original files they use odd encoding such as shown in this series: 8.58 .868L 5.03 2.20L 5.2, or they may append an X to invalidate a value (not my schema). So some files open with columns as double, others the same column as cells... the merge breaks on these.
If there is a parameter to deal with this let me know. Otherwise, thanks for the help and you answered the original question with the code you provided.
Glad I could help out.
Just so I understand, the problem you're describing isn't with the merging of tables, it's with importing the tables. Is that correct?
Have you tried importing the tables without using the PreserveVariableNames flag?
Could you attach one of the files causing problems?
The problem is I need to sterilize the data for posting, and as soon as I make any change and save the file it works. There is something hidden in the original CSV files which is corrupting the importing. Unfortunately I cannot upload these files without modification.
I think I tried the PreserveVariableNames flag which was unknown in 2016b. Not using it now.
I'm going to close teh thread - thanks for your help!

Sign in to comment.

More Answers (0)

Products

Asked:

on 13 Nov 2019

Commented:

on 14 Nov 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!