How do I create 2 separate matrices which are pair-matched to each other in the corresponding row from 2 original CSV files?

Question

Yen Yi Tan on 28 Aug 2019

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/477870-how-do-i-create-2-separate-matrices-which-are-pair-matched-to-each-other-in-the-corresponding-row-fr

Commented: Bob Thompson on 30 Aug 2019

I currently have two datasets of samples in two separate CSV files which I want to run a pair-matched network medical analysis on. Each of these have the same medical variables per dataset but have different sample numbers. For example, my first dataset contains 22 samples of ill patients with 18 medical variables. My second dataset contains 109 samples of recovered patients with 18 medical identical variables.

I want to be able to pair-match these samples yet keep them intact in their individual datasets so I can run individual network graph analyses on them. So, ideally my end result would be: my first dataset would contain 22 samples (ill), and my second dataset would contain 22 samples (recovered). The sample in row A of dataset 1 and dataset 2 would be matched by a variable (e.g. variable X are the same), the sample in row B of dataset 1 and dataset 2 would be matched as well by that same variable and so on.

I’ve written out the logic of code, but I’m having trouble arriving at the actual code (novice issues):

Import ill dataset (22 samples) as a matrix of 22x18, name dataset “A”.
Import recovered dataset (109 samples) as a matrix of 109x18, name dataset “B”.
The variable to be pair matched is column 6.
Select ID1 (row 1 - let’s assume each row is identified by an individual identifier tag for this logical argument) of A and compare column 6 variable against ID1-ID109 column 6 variable of B.
If A-ID1-6 = B-IDX-6 with only a single return (where IDX is the row returned with the matching column 6 variable), then replace A-ID1 in row 1 of C (new NaN 22 matrix for pair-matched ill) and replace B-IDX in row 1 of D (new NaN 22 matrix for pair-matched recovered), removing B-IDX from the original matrix.
If A-ID1-6 = B-IDX-6 with multiple returns, then replace A-ID1 in row 1 of C and randomly select one of the returns in B-IDX and replace in row 1 of D, removing B-IDX from the original matrix.
If A-ID1-6 = B-IDX-6 with no returns, then select closest B-IDX-6 within +/-0.5 of the column 6 variable, replacing A-ID1-6 in row 1 of C and replacing B-IDX in row 1 of D, removing B-IDX from the original matrix. If there is no +/-0.5 match for variable in column 6, return both with a NaN in C and D and remove both from original matrices.
Loop for A-ID2 to A-ID22 until C and D have 22 rows each which are pair-matched.
These new data tables will then be individually used to generate network maps.

I do apologize for the lengthy explanation. It is frustrating as I can see the logic but can’t outright turn it into code that works. Please don’t hesitate to ask if there is lack of clarity in any area and thank you in advance for anyone who can help me out.

4 Comments
Show 2 older commentsHide 2 older comments

Yen Yi Tan on 28 Aug 2019

For multiple returns, it would be key that the selection should be random. Ideally I’m trying to create a system that will select at random instead of just selecting the first match or the first closest match to prevent bias.

However, any progress is progress and if it takes getting just the first match working initially, I’m all for it.

Yen Yi Tan on 28 Aug 2019

Why I’m removing the sample from A and B after running it is to prevent duplicate samples. For example if I pair-match a patient from B to another patient in A, I don’t want the patient in B to be appearing twice in D, which would skew my results in the network graph.

However, that being said, it’s not necessarily a removal of the sample that has already been run. If there’s any other way to exclude it that would also solve the issue.

Sign in to comment.

Sign in to answer this question.

Answer 1

Bob Thompson on 28 Aug 2019

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/477870-how-do-i-create-2-separate-matrices-which-are-pair-matched-to-each-other-in-the-corresponding-row-fr#answer_389432

This is a first cut at how I would set up the loop and arguments.

for i = 1:size(A,1)
    tmp = B(B(:,6)==A(i,6),:);
     C(i,:) = A(i,:);
    if isempty(tmp) % No matches
        tmp = B(B(:,6)<=A(i,6)+0.5 & B(:,6)>=A(i,6)-0.5,:); % Expand range of check
    end
    if size(tmp,1)==1 & ~isempty(tmp)
        D(i,:) = tmp;
    elseif size(tmp,1)>1
        r = randi([1:size(tmp,1)]);
        D(i,:) = tmp(r,:);
    end
end

If you run your check, return no results, and don't get any results after the relaxed conditions you should end up with a row of NaNs.

I have not tested this, so there may be some slight errors. Feel free to debug as necessary.

6 Comments
Show 4 older commentsHide 4 older comments

Yen Yi Tan on 30 Aug 2019

Edited: Yen Yi Tan on 30 Aug 2019

Open in MATLAB Online

Hi Bob,

I have just retrieved my laptop from repairs and I am up and functioning again. Thanks for the answering all my queries, it helps my understanding of the code syntax tremendously.

I have run the code through using a sample of my data:

clc
close all 
clear all
tic 
B = importdata ('BDS_3MS_NHCC15.csv'); %ill dataset with 15 samples
A = importdata ('BDS_3MNS_NHCC15.csv'); %recovered dataset with 121 samples
toc
for i = 1:size(A,1)
    tmp = B(B(:,6)==A(i,6),:);
     C(i,:) = A(i,:);
    if isempty(tmp) % No matches
        tmp = B(B(:,6)<=A(i,6)+0.5 & B(:,6)>=A(i,6)-0.5,:); % Expand range of check
    end
    if size(tmp,1)==1 & ~isempty(tmp)
        D(i,:) = tmp;
    elseif size(tmp,1)>1
        r = randi([1:size(tmp,1)]);
        D(i,:) = tmp(r,:);
    end
end

This returns the following in the workspace:

C = 5x15 double

D = 1x15 double

i = 5

tmp = 5x15 double

Along with the following error messages:

Error using randi: First input must be a positive scalar integer value IMAX, or two integer values [IMIN IMAX] with IMIN less than or equal to IMAX.
Error in PMtest (line 17): r = randi([max(tmp(:,2)),min(1:size(tmp,1))]);

I've checked C and D, so far the sample in D(32) is matched with the sample in C(31.95), so something has definitely gone right there. The next 4 lines of C are the samples that follow after, so that's going right too. However, there are no corresponding returns to D, even when it's suppose to return a tmp that is a row of NaNs.

There are a few questions I'm thinking about as well:

If i = 5 in the workspace, and the loop defines i as i = 1:size(A,1), does that mean that the loop is only registering 5 rows, due to the error or otherwise?
I also see tmp is a 5x15 array - I understand it as whatever the second result has returned, there are currently 5 possible matches for it?
I can see what you're trying to do with randi in terms of selecting a random row in the tmp selection returned under the criteria, but with the error message must we change the syntax into a defined range of something like 2 row tmp to max row tmp? I've looked up some troubleshooting for it in terms of the code below.

rand_range=[max(A(:,nn)),min(A(:,nn))];
rand_range=[ceil(rand_range(1)) floor(rand_range(2))];
%if rand_range=[0.1 0.8]; this rounds it to [1 0], so sort to make sure the order is correct
rand_range=sort(rand_range);
randomfor(:,nn) = randi(rand_range);

Many thanks.

Bob Thompson on 30 Aug 2019

i = 5 because that is the iteration where the error crashed the code. If you place a debug marker somewhere within the code and run it until it stops at the debug marker each time through the loop it will return a progressively higher integer. The for loop doesn't define i as the range of values from 1:x, but rather as a specific value within that range for each iteration of the loop.
For A(5,6) there are five rows of B which return the same values, hence why tmp is 5x15. During the next iteration it would likely be a different size, perhaps 1x15, or [ ].
Yes, per the error you need to change the range I defined to just min and max. The current min and max values are fine, you just need to set them as two parts to a matrix.

r = randi([1,size(tmp,1)]); % Just replace : with ,

Glancing things over, I suspect D is only 1x15 because A(2:4,6) found no matches, even with the expanded range check. Because of this and the second condition check, then there is no value actually printed to D. The row would be NaN if the matrix had been preinitialized, but it doesn't look like you did that, so there are no results printed. I would suggest preinitiallizing both C and D.

Yen Yi Tan on 30 Aug 2019

Hi Bob,

Ah, so the i here is picking at the specific value of column 6 with every iteration that passes.
Got it, I fully understand why and how having a temporary variable that resets with each loop works.
Ah, they were a range to begin with with 1:size(tmp,1) but it was just the syntax in this case that was slightly incorrect for the randi input.

I've re-run the code and preinitialized C and D as you suggested. It didn't occur to me that the NaN matrix should be the default as the logic replaces a no return with whatever is in tmp, which isn't NaN by default.

You have been such an amazing help with my issue. Thank you for beng so patient with me. If I may ask, are there any specific resources you can recommend at getting better at MATLAB? I'm using it in a project focused manner, e.g. only using tools and syntax as required, but I can imagine it helps a fair bit to have some basic foundation. I still do struggle with syntax and understanding how to make the code flow, even if I can see the logic clearly.

Bob Thompson on 30 Aug 2019

The two things that have helped me be better at MATLAB, besides projects, are being active on these forums, and 'Cody' here on the mathworks website. The first has introduced me to a number of new concepts, or better ways of doing what I already know, while the second has offered me a wide variety of small challenges that I can learn how to solve without having to jump into some major project.

Other than that, reading the documentation, coupled with the forums, has helped teach me the vocabulary that the Matlab world tends to use, which makes it much easier to look at the documentation for new functions and be able to pick them up quickly.

Sign in to comment.

How do I create 2 separate matrices which are pair-matched to each other in the corresponding row from 2 original CSV files?

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How do I create 2 separate matrices which are pair-matched to each other in the corresponding row from 2 original CSV files?

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

6 Comments
Show 4 older commentsHide 4 older comments