Extracting data from from all rows aside from those that contain NaN?

How would I go about extracting data from one column based on the criteria of the other column?
For example, in column A, I have a few NaN values, after setting up my own criteria for this column. Column B has a its own values from randn values.
I now want to calculate the average of column B, but only rows that have a corresponding value in column A. Thus, if any value in column B corresponds to a NaN value in column A, I want to exclude that value from my average.
I'm tinkering around with the find function, but no luck yet.

 Accepted Answer

select = ~isnan( A ) ; % Vector of logicals, true at location corresponding
% to non-NaN elements of A.
values = B(select) ; % Logical indexing of B => extract elements of B
% which correspond to non-NaN elements of A.
Then you apply whatever you want to values. Note that there are functions which can make this selection for you, e.g. NANMEAN from the Stat. toolbox.
If you need these values for a one shot operation (which means if you don't have to reuse them), you can skip intermediary variables/steps. For example, fr sum these elements with a one-liner, you can do
sum( B(~isnan(A)) )

6 Comments

I tried using the NANMEAN function, but the problem I ran into was the fact that I wanted corresponding values to be taken out, as in rows that corresponded to NaN values, I didn't want those to be included in my average.
A B
2 0.1
3 0.5
NaN 0.6
5 0.2
So I want the average of B, but taking out those rows that correspond to a NaN value in column A, if that helps.
I'm confused as to the second part of your answer. I understand the first part, where we can see where in the matrix are there non -NaN values, I'm just confused as to the second part.
I ended up just piecing together something I'm sure is next to useless:
P = find(isnan(Z))
Z(isfinite(Z(:, 1)), :) = O
mean(O(:,2))
It seemed to have worked, but I'm new to programming, so I'm sure there are a plethora of easier ways.
I'm still trying to figure yours out since it seems much more convenient.
Well, you are right that NANMEAN doesn't apply to your situation, my mistake. What you wrote is far from being useless; it's actually what we all do in the beginning and it's a good thing that you test these approaches as well. What you are missing is just a type of indexing called "logical indexing", which makes everything simpler and more efficient. Here is an example:
>> A = [1, 2, NaN, 5, NaN, 4] ;
>> isnan( A )
ans =
0 0 1 0 1 0
>> class( ans )
ans =
logical
Here you can see that the output of ISNAN is a vector of logicals (class or type), whose elements true (1) or false (0) are the logical outcome of the test "is nan" applied to each element of A. The first thing that comes to people's mind when they get that is to find the position of NaNs in the vector, as follows:
>> find( isnan(A) )
ans =
3 5
which indicates that elements 3 and 5 of A are NaNs. Now as we already evaluated isnan(A) (which was the array of logicals), you see that FIND is actually not "searching for NaN values", but simply returning a list of positions where its argument is not false/0.
Now you could use these positions for indexing, e.g.
>> pos = find( isnan(A) )
pos =
3 5
>> A(pos) % Just to be sure that we spot the NaNs.
ans =
NaN NaN
Or use the same positions in another vector..
>> B = [11, 12, 13, 14, 15, 16] ;
>> B(pos)
ans =
13 15
But there is better/simpler to do than using positions. MATLAB can use directly a vector of logicals for indexing (and return elements where logical elements are true):
>> lid = isnan( A )
lid =
0 0 1 0 1 0
>> A(lid)
ans =
NaN NaN
>> B(lid)
ans =
13 15
which is the same as before, except that we spare the extra step of FIND-ing positions and to use them for indexing. Also, taking all other elements is simpler when you have a logical index than when you have positions, because you can take the negation of the vector of logicals and you don't have to compute all other positions.
>> ~lid % "not" applied to the logical elements of lid.
ans =
1 1 0 1 0 1
>> B(~lid)
ans =
11 12 14 16
(actually, if you really wanted positions, you would use find(~isnan(A)), rather than trying to compute these positions)
So this is what I am doing in the second part of my answer, except that I don't create an intermediary variable lid but I use directly the output of ~isnan(A) to index B:
values = B( ~isnan(A) ) ;
theMean = mean( values ) ;
or, directly
theMean = mean( B(~isnan(A)) ) ;
Thanks for the great detail, really appreciate it.
Since I'm using one matrix (Z) which is [a b], in your last step where you have the B and A variables, would I provide the column number within it?
theMean = mean( B(:,(~isnan(:,A)) ) ;
I compared the answer from your method to the one listed, with both being the same. Thanks for providing the simpler and more efficient way.
In that case, Z(:,1) is what is named A above and Z(:,B) is B. So you can work with Z as follows:
lid = isnan( Z(:,1) ) ;
values = Z(lid,2) ;
theMean = mean( values ) ;
or
values = Z(~isnan(Z(:,1)), 2) ;
theMean = mean( values ) ;
or
theMean = mean( Z(~isnan(Z(:,1)), 2) ) ;
------
A
-------------
lid
--------------------
values
Okay I'm just trying to figure out the final line provided, I just want to break it down piece by piece.
So isnan returns values in the matrices based in a true.false or 0.1 manner.
In the first line
lid = isnan( Z(:,1) ) ;
should it be ~isnan?
so in the first line, you're finding the variables in matrix Z, column 1, that are not equal to NaN(or the false value), and equaling this to lid
in the second line, values is your Z matrix in correspondence to the Lid row number, in the 2nd column
then since you have your appropriate rows in the 2nd column, you're just then finding the average
I think I got it, I'm just trying to figure out whats happening between the first and second line. I'm confused as to why after the first function, which recreates the array in true/false or 1/0s, how the second line then only finds the values of those that correspond to the true/1's rows of the column.
When you create variables with logical expressions (such as lid = isnan(Z(:,1)); and then when you nest that in another line of code (such as values = Z(lid,2)), that second line of code will only act on the desired logical expressions (in this case the 1's or where there are no NaN values)?
Thanks again for the response, this was tremendous help in learning this proces.
Most of what you say is correct, and you spotted a typo! :-) You are right, I should have written either
lid = isnan( Z(:,1) ) ;
values = Z(~lid,2) ;
theMean = mean( values ) ;
or
lid = ~isnan( Z(:,1) ) ;
values = Z(lid,2) ;
theMean = mean( values ) ;
As you explain, both are somehow the lengthy way to perform the same thing as the more compact
theMean = mean( Z(~isnan(Z(:,1)), 2) ) ;
MATLAB evaluates the most internal expression, and pipes the output in its container (don't know if this "pipe" analogy will help), and so on. When the large expression above is evaluated, MATLAB does the following roughly..
Evaluate: Z(:,1) (innermost)
|
| put the result in (or pipe)
v
Evaluate isnan( )
|
| put the result in (or pipe)
v
Evaluate ~( )
|
| put the result in (or pipe)
v
Evalaute Z( , 2)
|
| put the result in (or pipe)
v
Evaluate mean( ) (outermost)
Internally, MATLAB computes inner expression and creates intermediary arrays with results, that it passes to outer expressions. Now if you think that it makes the code clearer to make these intermediary steps by yourself and build your own intermediary arrays, you are free to do so. This is what I did when I built these examples with intermediary steps. The second reason for creating intermediary variables by yourself is when some intermediary variable could be reused elsewhere. Say, for example, that you have A with NaNs, and you want to address B, C, D, .., Z at locations where A is not NaN. You can do something like
meanB = mean( B(~isnan(A)) ) ;
meanC = mean( C(~isnan(A)) ) ;
meanD = mean( D(~isnan(A)) ) ;
..
but it is inefficent because ~isnan(A) is recomputed each time. In such case, it is more efficient to compute it once only and reuse it:
lid = ~isnan( A ) ;
meanB = mean( B(lid) ) ;
meanC = mean( C(lid) ) ;
meanD = mean( D(lid) ) ;
..

Sign in to comment.

More Answers (0)

Asked:

on 8 Oct 2013

Commented:

on 8 Oct 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!