count the change between a row and the next, in a column

I have a series of data in a column, and I want to count how many times the data, between one row and the next, change its content (my column is a categorical column). Below, I put an example:
Here, I have 4 changes. The dimension of my timetable is 78397x5. How can I do? Thanks in advance for the help.

 Accepted Answer

For categorical array c
chararr = cellstr(c);
num = cellfun(@(x) regexp(x,'(\d)*$','match','once'), chararr,'un',0);
nchange = nnz(diff(str2double(num)));

7 Comments

Note that this method only works when there are numbers in the category name, only one number per category name, and those numbers are the only thing that changes between category names (as in the example provided in the question).
This method doesn't generalize to any type of category name.
For example, it would fail under these types of category names (showing unique categories only):
  • {g1, g2, b1, b2}
  • {g, g2, g3, g4}
  • {g1b1, g1b2, g2b1, g2b2}
  • {a, b, c, d}
Adam, that's correct. How can I provide a solution based on category names which are not known? The assumption is, yes, that the digits occur at the end of the names hence the $ anchor in the regular expression. I can only work with the data Erica provided.
It does works well with her categories as long as all of the categories follow that format. I just wanted to add that note in case others are looking for similar solutions with different category types.
Thanks for the answers. Paolo's code works well, but I realized that the column concerned (Column1streamId) contains, occasionally, different writings from 'dbuid-xx', such as 'rule-personleaving' or others. Adam's code, I can not get it to work. I attached my timetable to better understand my problem. Furthermore, if I wanted to know how many times there is a change from dbuid-20 (which is the PIR sensor of the bedroom) to dbuid-17 (which is the PIR sensor of the bathroom) and calculate the average time spent in the bathroom, as I can do?
For a string such as 'rule-personleaving', what is the required behaviour?
Let's say the sequence is
{'dbuid-20'}
{'rule-personleaving'}
{'dbuid-20'}
Do you want to ignore the middle string because it has no numerical information, or do you wish to consider it? For this example, are there zero changes (middle string is ignored) or two (middle string is considered)?
How is the average time spent defined? Does every occurance of 'dbuid-20' indicate a time period, i.e. 20 minutes spent in bathroom?
I want to see how many times my patient gets out of the bed room at night (whose PIR sensor corresponding to the bedroom is the dbuid-20) and walks around the home. Once counted the times when it comes out of the bedroom, I want to see how many times it goes to the bathroom (which is the dbuid-17) and the average time of stay. For this count, I can ignore rows that do not contain dbuid-XX. the PIR sensor generates a row of my table every time it detects a movement, every 3 seconds. To calculate the average time spent in the bathroom, I have to consider the consecutive rows in which dbuid-17 is present. You consider that I have a timetable, so the time for each line is expressed as date and HH: min: sec. I have to do this calculation every time I have successive rows of dbuid-17
Erica, what didn't work with my solution? Do you have access to grp2idx()?
I updated my solution to use your tab_night data as input rather than my fake data (the algorithm didn't change). I also added the solution to count the number of changes between _20 and _17. Note that this solution should work on any and all category types and is not dependent on the naming if the category.
For you data, there were 12468 total changes in category and there were 927 changes from _20 to _17 (which is a lot of trips to the bathroom).

Sign in to comment.

More Answers (2)

A solution is to convert your categorical data to numerical using gr2idx() if you have the statistics toolbox. Then you can differentiate the numerical array and count how many times the differentiation isn't zero (ie, count the number of times there is a change). Note that this solution will work with any category names. Using the data you attached, here's how to count the number of changes:
% Copy list of all categories
cats = tab_night{:,1};
% Convert to numerical values
n = grp2idx(cats);
% Alternatively, as @Guillaume pointed out: n = double(c);
% now differentiate and count the number of changes
count = sum(diff(n)~=0);
If you want to know the index value where the change occurs, these are the row numbers:
find(diff(n)~=0)+1
If you want to count the number of times the category changes from ..._20 to ..._17 here's how to do that:
% Count changes from dbuid-20 to dbuid-17
% First we need to find which values of 'n' correspond to the 2 categories requested
dbuid17 = n(find(cats == 'dbuid-17',1)); %=4
dbuid20 = n(find(cats == 'dbuid-20',1)); %=7
% now we need to count the number of times n changes from 7 to 4
% 4 minus 7 is -3 so if we differentate n we just need to count the number of '-3's
% along with the number of times it changed to category _17.
nDiff = diff(n);
changIdx = nDiff==(dbuid17-dbuid20) & n(2:end)==dbuid17;
nChanges = sum(changIdx);
Finally, to confirm that everything worked well, use the 'changIdx' to look at which categories were flagged as a change from _20 to _17:
cats(find(changIdx)) %Should all be dbuid-20
cats(find(changIdx)+1) %should all be dbuid-17
If you don't have access to the grp2idx function this won't work. Otherwise, it works for all category types.

4 Comments

Thank you very much Adam, now the code works great! If I wanted to calculate the average time spent in the bathroom, how can I do?
Good! I'll need to understand your data before I could help you calculate the mean. The first two entrees of your data are
13-Mar-2018 06:09:17 dbuid-20
13-Mar-2018 06:09:11 dbuid-20
Does that mean a sensor #20 was signalled at those two times? Is one of them an entry time and other an exit time (assuming the sensor is on the bathroom door which has 1 entrance). In short, you need to be able to identify if the time stamps are entry or exit for each sensor.
Would the column1type data help with that?
unique(tab_night.Column1type)
alarm/nobedreturn
alarm/personleaving
bed/in
bed/out
door/closed
door/open
motion/on
rule/active
rule/inactive
The PIR sensors detect the movement of the person and send the signal every 3 seconds: this means that every row of my table represents the moment when the PIR sensor has detected movement; it will send a new signal after 3 seconds (even if the person keeps moving in the room within 3 seconds). For the average time calculation, column1type data does not help.
Hi Erica, you've got a nice chunk of work to do on this and it's an interesting project so I hope you succeed. I'll break down the steps you need to do but there will inevitably be additional bumps along the way.
  1. You need to determine when someone is in room X. You should be able to use my code to get you started on that. For example, you can identify any time someone is in the room 'dbuid20' by the logical vector created by n==dbuid20. This will give you a string of [0 0 0 1 1 1 1 1 0 0 0 0 ....] where the 1s indicated presence in the room.
  2. Using that logical vector, you need to figure out when each string of 1s start and end. For example, [0 0 0 1 1 1 1 0 0 0 1 1 0 0] in that string the ones start at [4,11] and end at [7, 12]; Hint: if the logical vector is named 'inRoomX', you can use the output of diff(inRoomX) which will have a 1 or -1 for entering or leaving the room. However, you may need to clean up the data in cases where there is only 1 sample within the room (ie, the visit was <=3 seconds).
  3. What you ultimately want is two logical vectors: one that marks room entry and one that marks room exit. Once you have that, you can pull the corresponding time stamps from your data and calculate the amount of time in the room. Follow this example below.
% Create fake exitRoom and enterRoom logical vectors. In each of them, 1 indicates enter or exit.
exitRoom = false(size(tab_night.t));
exitRoom(1:1000:50001) = true;
enterRoom = false(size(tab_night.t));
enterRoom(400:1000:50400) = true;
% Get enter times
EnterTime = tab_night.t(enterRoom);
% Get exit times
ExitTime = tab_night.t(exitRoom);
% Subtract the two, assuming that all exit times come after enter times
% 'timeInRoom' should all be positive!! If not you have problems.
timeInRoom = ExitTime - EnterTime;
%How much total time in room?
sum(timeInRoom)

Sign in to comment.

Hum, unless I'm very mistaken, wouldn't the simplest solution be to convert the array to double, then nnz(diff()) that?
%build demo array
c = categorical({'dbuid-20'; 'dbuid-17'});
c = c(repelem([1 2 1 2 1], [7 3 5 3 1]))
%result
ndiff = nnz(diff(double(c))) %all done!
No need for the statistical toolbox and works with all categorical arrays.

1 Comment

Yes, double(c) is more simple than grp2idx(c) and produces the same results. Good call.

Sign in to comment.

Categories

Asked:

on 28 Jun 2018

Edited:

on 29 Jun 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!