excel table data analysis

Question

0 votes

Hi guys

My data is (are) in an excel table 20000 by 3. I have a list of 358 patients. There are 48 different bacteria that I am looking for. each patent is tested to see if a bacteria is found in their blood. The day each test is conducted is recorded.

1) when was the first day the patient tested positive (regradless of bacteria type) (ex:first time patient A was positive is day 2)

2) List of all the bacterias that have been found for each individual patient.( ex: Patient A shows bactesias:BAC1 BAC2 BAC3)

3) When was the first time each patient tested possitive for each type of bacteria.(ex: First time BAC1 was found is day 4)

4) In what days was each patient exsibiting each bacteria (ex: BAC 1 was forund for patient A in days 300, 4)

The table is in the following format:

  patient   day they were tested         bacteria found  
  A                   300                                    BAC1 
  A                    2                                       BAC2 
  A                    4                                       BAC1 
  A                    8                                       BAC3 
  B                   66                                      BAC5 
  B                   55                                      BAC1 
  C                  208                                     BAC2 
  C                  77                                       BAC2 
  C                  51                                       BAC9 
  D                 90                                        NAN 

I have been struggling with this for a while. I would appriciate any input.

Please let me know if clarifications are needed. Thanks so much for the help in advance!

2 Comments
Show None Hide None

dpb on 10 May 2019

Is the first time really 300, then 2, or is that a typo or are there multiple years in the dataset and the times are day of year?
How do you get 2 for 1.? Looks to me like was 300--there's something reported for first time in the dataset (BAC1).

With a little clarifcation on what the dataset parameters are, doesn't look terribly complex. Convert the Patient and Bacteria to categorical variables, put in a table and use findgroups to identify patients...the when was the first? questions are simple lookups but again what the time range and definition is may muddle the logic if there's wraparound as looks like might be will need to fix that issue first.

The "how many days?" questions are also simple lookup; with categorical data matches are exact so ismember or similar will answer those trivially.

ALDO on 10 May 2019

Thanks for your fast responce!

1)'day they were tested' is the day they took the test so, day 0 is the day the patient was admited to the hospital and day 300 is the day they took the test. the tabel order doesnt represent which test was taken first.

2)the answer is 2 because they first time the patient tested positive was on day 2. so out of day-300 day-2 day-4 day-8. the smallest would be 2. the smaller the number the sooner the test was taken.

I did the categorial variables and find groups as you said. This is what i have so far as my code i know i have overcomplicated things but This is my first time working with tables so feel free to point out my mistakes since i am not familiar with what to do:

Table=table(Pateint, Day_of_Culture, Organism,'VariableNames',{'patient' 'day' 'Organism'});

Table.Day_of_Culture=categorical(Table.Day_of_Culture);

Table.Organism=categorical(Table.Organism)

patients=unique(patient);

orgnz=unique(Organism);

G = findgroups(Table.patient);

how_many_pateints=max(G);

func=@(varargin) {table(varargin{:}, 'VariableNames', Table.Properties.VariableNames)};

output = splitapply(func, Table,G);

result= zeros(357,48);

for i=1:how_many_patients

[C,ia,ic] =unique(output{i,1}.Organism); %Unique Organisms found for each %patient

Organism_for_each_patient{i}= C;

for j=1:length(C)

u=output{i,1}.Organism((strcmpi(output{i,1}.Organism,C(j) )));%strcmpi will find the indexes of the variable

v=output{i,1}.Patient((strcmpi(output{i,1}.Organism,C(j) )));

w=output{i,1}.Day_of_Culture((strcmpi(output{i,1}.Organism,C(j) )));

table2=table(v,w,u,'VariableName',{'Patient' 'day' 'Organism'});

result(i,j)=min(w);

end

%first time the first variable name i am looking for ex:{'BAC1'} occured , %so the smallest day of cuture

end

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 10 May 2019

Edited: dpb on 10 May 2019

Open in MATLAB Online

2 votes

It's not trivial to learn the tricks, granted...but here's a start for the first...with it and some study of the examples, should get an idea -- it's late; I've got to turn in at this point, though..sorry :)

tBAC=readtable('bacteria.dat','ReadVariableNames',1);     % read the data in
tBAC.patient=categorical(tBAC.patient);                   % fix data types
tBAC.found=categorical(tBAC.found);
[ig,Patient]=findgroups(tBAC.patient);                    % group index, group names
FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig);  % find first infection any kind
FirstInfection(cellfun(@(x) length(x)==0,FirstInfection))={inf};                  % clean up missing (no infection)
[table(Patient) cell2table(FirstInfection)]                                       % display result in a table form

results in for the first problem statement...I chose "inf' as the indicator of no infection instead of NaN -- the decimals show up because I'd been doing financial work and had format bank in effect.

ans =
  4×2 table
    Patient    FirstInfection
    _______    ______________
    A           2.00         
    B          55.00         
    C          51.00         
    D            Inf         
>>

I slightly modified your data to a file with the header line "patient, testday, found" and made a csv-file of your plain text...I presume you have a file format of your own.

Modifications to the functional for the remainder should suffice I think...altho 3 and 4 need grouping by both patient and bacteria ID.

ADDENDUM:

Added the identifier for which bacteria ID was the first. Having the standalone function means can clean up the return data there instead of afterwards--so there is some payback for the extra code. :)

NB: I got the return arguments from min in wrong order last night; the index is the optional second, not the first. This produces the amplified table:

tBAC=readtable('bacteria.dat','ReadVariableNames',1);     % read the data in
tBAC.patient=categorical(tBAC.patient);                   % fix data types
tBAC.found=categorical(tBAC.found);
% 1.  First occurrence of any in each patient
[ig,Patient]=findgroups(tBAC.patient);                    % group index, group names
%FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig);  % find first infection any kind
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig);
table(Patient,FirstInfection,Infection)
% 2.  All occurrences in each patient
[AllInfections]=splitapply(@allinfections,tBAC.found,ig);
[table(Patient) cell2table(AllInfections)]
% 3.  First occurrence of each bacterium in each patient
[ig,Patient,Bacterium]=findgroups(tBAC.patient,tBAC.found);
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig2);
table(Patient,Bacterium,FirstInfection,Infection)
% 4.  Ooccurrences of each bacterium in each patient
%  EXERCISE FOR STUDENT  :)
function [tFirst,bFirst]=firstinfected(t,r)
  % return first time, infection 
  [tFirst,iFirst]=min(t(r~='NAN'));
  if isempty(tFirst)
    tFirst=nan;
    bFirst='NAN';
  else
    bFirst=r(iFirst);
  end
end
function [b]=allinfections(r)
  % return all infections for each
  b={unique(r(r~='NAN')).'};
  if isempty(b)
    b='NAN';
  end
end
>> table(Patient,FirstInfection,Infection)
ans =
  4×3 table
    Patient    FirstInfection    Infection
    _______    ______________    _________
    A           2.00             BAC2     
    B          55.00             BAC1     
    C          51.00             BAC9     
    D            NaN             NAN      
ans =
  4×2 table
    Patient      AllInfections  
    _______    _________________
    A          [1×3 categorical]
    B          [1×2 categorical]
    C          [1×2 categorical]
    D          [1×0 categorical]
ans =
  8×4 table
    Patient    Bacterium    FirstInfection    Infection
    _______    _________    ______________    _________
    A          BAC1          4.00             BAC1     
    A          BAC2          2.00             BAC2     
    A          BAC3          8.00             BAC3     
    B          BAC1         55.00             BAC1     
    B          BAC5         66.00             BAC5     
    C          BAC2         77.00             BAC2     
    C          BAC9         51.00             BAC9     
    D          NAN            NaN             NAN      
>> 

Unfortunately, the builtin table display function won't show the actual categorical variable values for each patient since they're not the same length of each array--and a table has to be regular in number of variable sfor each row/observation so can't create multiple variables without a lot of ugly NAN values scattered around.

The really cute part is the firstinfection function works for any chosen grouping so that don't have to do anything except use the other grouping variables. You could choose to not populate the tble with the second return or not use the second ID variable since they are the same...

Now, your mission, should you choose to accept it, is last item, #4... :)

11 Comments
Show 9 older comments Hide 9 older comments

ALDO on 14 May 2019

I want to see if the result of the test was positive or negative. So I have to look at each patient and see in what day ranges the bacteria was found. And depending on wether it was within the range it gives a logic 0 1 response. For example for patientA and look at each individual bacterium that was found. Was BAC1 was found on day 1, If yes gives logic answers 1. Also assess if it was present in specific day ranges of days, day 1-28 if yes gives logic answer 1. The specific days i want to look st are: day 1 2 3 The ranges i wasnt to look at are: 1-2 1-14 1-7 1-28 29-56 57-84.

I hope I explained the question correctly please let me know if you need any clarifications.

You mentioned that I could use summary Satistics thats why i was trying to use grpstats.

dpb on 14 May 2019

OK. that makes sense -- since you are actually counting days independent of the calendar, simply arranging by those numeric days is all that is needed.

I've got to go do some other stuff at the moment, but I'll mull over the idea...the general idea I'm thinking of is that one of the statistics grpstats returns in number elements--if there is a grouping variable for each category, then the answer of logical True is equivalent to number of elements in the group.

However, it seems to me you need a grouping variable for each of the combinations; at least otomh I don't see an easy way to combine results other than just the one case of 1-2 is 1 + 2 but none of the rest are composited uniquely from others so have to group each separately.

Sign in to comment.

excel table data analysis

2 Comments
Show None Hide None

Accepted Answer

11 Comments
Show 9 older comments Hide 9 older comments

More Answers (0)

Categories

Tags

Community Treasure Hunt

excel table data analysis

2 Comments Show None Hide None

Accepted Answer

11 Comments Show 9 older comments Hide 9 older comments

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

11 Comments
Show 9 older comments Hide 9 older comments