Align Two Data Sets by Date and Location Where Both Data Sets Are Missing Data Points at Random Dates and Locations?

4 views (last 30 days)
Howdy. I have two data sets which I've read into two different arrays and sorted by the date and station of sampling. Dates are in column 1, are arranged in ascending order, and are formatted as a double that looks like yyyymmdd (in other words, May 5 2005 looks like 20050505). Column 2 contains the station at which that sample was taken. There are 7 stations, and at each time point the stations are arranged in ascending order from 1-7. The problem is at certain times/stations I have data in one data set but not in the other. This has resulted in one array with 517 data points (rows) and the second with 551 data points. I should also mention that the two arrays contain different numbers of columns (variables) and that my final goal is to look at correlations between the variables in each data set. In other words both matrices have the following form, but are missing certain (and different) dates/stations: [20050505, 1, data; 20050505, 2, data; 20050505, 3, data; 20050505, 4, data; 20050505, 5, data; 20060615, 1, data; 20060615, 2, data; 20060615, 3, data; 20060615, 4, data;], etc.
I've been trying to find ways to find the indices where the first two columns of the two arrays differ and then do one of two things: either 1. insert a row of NaNs where one array is missing a data point or 2. delete the data points in one array where there is no data for that date and station in the other. I have searched extensively for ways to solve this problem and have not found anything. I have also played around quite a bit with the "union" and "ismember" functions but have had no luck.
I would greatly appreciate any advice on other functions or code which may help accomplish this task.
Thanks!!

Accepted Answer

Peter Perkins
Peter Perkins on 26 Mar 2016
Dylan, if I understand your description correctly, this is a pretty classic example of an outer join:
First cook up some fake data like (I think) yours:
>> date = [1 1 1 2 2 2 3 3 3]';
>> station = [1 2 3 1 2 3 1 2 3]';
>> x = randn(size(date));
>> y = randn(size(date));
>> t1 = table(date,station,x,y,z);
>> z = randn(size(date));
>> u = randn(size(date));
>> v = randn(size(date));
>> t2 = table(date,station,u,v);
>> t1 = t1([1:4 6:end],:)
t1 =
date station x y z
____ _______ ________ _________ ________
1 1 0.03348 -0.83137 1.7119
1 2 -1.3337 -0.97921 -0.19412
1 3 1.1275 -1.1564 -2.1384
2 1 0.35018 -0.53356 -0.83959
2 3 0.02289 0.96423 -1.0722
3 1 -0.262 0.52006 0.96095
3 2 -1.7502 -0.020028 0.12405
3 3 -0.28565 -0.034771 1.4367
>> t2 = t2([1:6 8:end],:)
t2 =
date station u v
____ _______ ________ ________
1 1 -1.1201 -2.3299
1 2 2.526 -1.4491
1 3 1.6555 0.33351
2 1 0.30754 0.39135
2 2 -1.2571 0.45168
2 3 -0.86547 -0.13028
3 2 0.79142 -0.47615
3 3 -1.332 0.86202
t1 and t2 are missing one row each, in different places. Now join on the station and time:
>> t = outerjoin(t1,t2,'Keys',{'date' 'station'},'MergeKeys',true)
t =
date station x y z u v
____ _______ ________ _________ ________ ________ ________
1 1 0.03348 -0.83137 1.7119 -1.1201 -2.3299
1 2 -1.3337 -0.97921 -0.19412 2.526 -1.4491
1 3 1.1275 -1.1564 -2.1384 1.6555 0.33351
2 1 0.35018 -0.53356 -0.83959 0.30754 0.39135
2 2 NaN NaN NaN -1.2571 0.45168
2 3 0.02289 0.96423 -1.0722 -0.86547 -0.13028
3 1 -0.262 0.52006 0.96095 NaN NaN
3 2 -1.7502 -0.020028 0.12405 0.79142 -0.47615
3 3 -0.28565 -0.034771 1.4367 -1.332 0.86202
Hope this helps.

More Answers (1)

Kelly Kearney
Kelly Kearney on 26 Mar 2016
The ismember and unique functions, with the 'rows' option in both, should help:
data1 = [...
20050505 1 1
20050505 2 2
20060615 1 3
20060615 2 4
20060615 3 5
20060615 4 6];
data2 = [...
20050505 1 7
20050505 2 8
20060615 1 9
20060615 2 10
20060615 3 11
20060616 4 12];
unq = unique([data1(:,1:2); data2(:,1:2)], 'rows');
[~, idx1] = ismember(data1(:,1:2), unq, 'rows');
[~, idx2] = ismember(data2(:,1:2), unq, 'rows');
newdata = [unq nan(size(unq,1),2)];
newdata(idx1,3) = data1(:,3);
newdata(idx2,4) = data2(:,3);
Result:
newdata =
20050505 1 1 7
20050505 2 2 8
20060615 1 3 9
20060615 2 4 10
20060615 3 5 11
20060615 4 6 NaN
20060616 4 NaN 12
  1 Comment
Dylan
Dylan on 26 Mar 2016
Ah, I see. You've helped me find my mistake, I was doing something similar, but the unique function was not working for me. Apparently it does not recognize NaN's as equivalent and so was not eliminating identical rows. Thanks for the help!

Sign in to comment.

Categories

Find more on Cell Arrays in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!