I need to extract specific data and sum them up in different variables

Question

0 votes

I have a table of data - essentially two columns and about 8000 rows. Among those rows is information i need to collate to analyse later. These were in a long data text file. It looks like this:

Column 1 Column 2

BEGIN STUDY

Block 1

Ratio Known Outcome known

Ratio 30/70

green

red

green

Decision Bean count 4

I have about 40 participants and each participant does four blocks (four different conditions - configurations of ratio known or unknown, outcome known or unknown) and each condition has four ratios. Needless to say, for each participant and across participants, and conditions/ratios, this 'decision bean count' could be different, so the number of rows between 'ratio' and 'decision bean count' is also different.

I had this crazy idea i could do a while loop, but have no idea how to phrase this ... find the beginning of each participant/block/ratio, followed by: while decision bean count = false, row x + 1 (yes, you can laugh, this is how much of a novice I am).

I need to extract the data (i.e. the outcome measure of interest which is the Decision Bean Count) for:

Each participant

Differentiating different conditions

Differentiating different ratios

At the end i need to have two matrixes, which four columns each:

1 with all the Decision Bean count scores, per participant, per condition

1 with all the Decision Bean count scores, per participant, per ratio

I am a total beginner in Matlab, the only thing I have managed to do, is get Matlab to create a table with my data. Whatever I try next, it just gives me the same table in the output ...

This feels very complex to me, but if it is crystal clear for anyone out there who could help, I would be so grateful

kind regards

Laura

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

William Rose on 29 May 2022

1 vote

@Laura Lennuyeux-Comnene,

I think I and others will be able to understand you problem better if you post a spreadsheet with columns for each of the different quantities. I expect that there will be columns for:

Subject number (1 to 40); "configuration of ratios known" (true or false); "outcome known" (true or false); ratio (four possible values). Every row should include a value for all of these values, even if the value is repeated from the preceding row. In other words, there will be many rows with each Subject Number, and so on.

There will be more columns, but I am not certain what they will be, because I do not understand your experimental protocol. Maybe there is a column for color (red or green). Maybe there is a column for Bean Count, or Decision Bean Count, or both. Is the value in Decision Bean Count a number, or a True/False? Do you want to compute the value of Decision Bean Count based on the values in the preceding coumns and rows? This is where I do not understand the problem.

15 Comments
Show 13 older comments Hide 13 older comments

Laura Lennuyeux-Comnene on 29 May 2022

jellybeanstudy.xlsx

Hi,

I didn't want to post the excel spreadsheet because it's got so many rows.

The problem is that I don't have the columns as you describe, and there is no participant 1, 2 etc....

I attach a section of it here.

The experiment:

participants were shown two jars with different ratios of jellybeans of two different colours

the computer selects one

the participant picks beans until they think they know which colour is dominent (mostly red or mostly green)

there are four different ratios

there are four conditions: one where they know the ratio and they find out if they picked the right colour at the end of every trial, one where they know the ratio but not the outcome, one where they don't know the ratio but know the outcome, one where they know neither.

on the spreadsheet you see:

begin study (start of that participant)

ip address

SONA id

Qualtrics Id (they had to fill in questionnaires before doing the task)

Block number

condition (these are counterbalanced)

Trial number

ratio (this may or may not be displayed to participants depending on condition)

For all of the above, there is nothing in column 2

then you have the beads that the participants pick

After every colour , there is a number: these are miliseconds (reaction time)

Then there is the decision they made, followed by reaction time

then there is a confidence rating (1-100)

then there is 'winnings' - participants start with 50p, lose money every time they pick a bean, get nothing if they choose the wrong colour - they get this at the end of every trial. This is displayed to the participants only on 'known outcome' trials.

then there is decision bean count: this is how many beans they picked before making up their minds. This is the outcome variable of interest.

At the end of all the blocks, there is a final line, which is

Total winnings - the sum they won across all trials

Then the next participant begins ...

Does that make any more sense?

Thanks so much for looking into this, I really appreciate it!

kind regards

Laura

William Rose on 29 May 2022

@Laura Lennuyeux-Comnene,

It is my understanding, based on your original post and your comment, that Decision Bean Count is aready in the text file. I think you are saying that you do not need to calculate it. You just need to extract it. For every DBC, there is also a participant number, a ration, and two experimental conditions, which are in the text file, but not necessarily on the same row as the DBC.

You want to create two matrices with four columns each.

Matrix A has "all the DBC scores", i.e. it has as many rows as there are DBC scores. The columns are: subject number, condition 1, condition 2, DBC. Condition 1 is "configuraiton of ratios known", and has value T or F (1 or 0). Condition 2 is "outcome known", and has values T or F (1 or 0).

Matrix B has "all the DBC scores", i.e. it has as many rows as there are DBC scores. The columns are: subject number, ratio (4 possible values, and DBC. YOu said this matrix has four columns, but I do not understand what the fourth column is.

I don't think I will be able to assist, without seeing the text file, to get a better understanding. I'm sure there are people on this site who are a lot better than me at working with text fles.

Laura Lennuyeux-Comnene on 29 May 2022

Jellybeans2.txt

Hi,

I think i see what you mean, but I am not sure ... I attach the original text file here. You will see there is an extra row there: that is the distribution on that particular trial of greens to reds (I can't remember which is which, but 1 is either red or green and 2 is either red or green - I removed that row when I put it into excel)

I need to differentiate between the four different ratios (75/25; 70/30; 65/35; 60/40) - that was what I had called a matrix (but maybe I mean table) with four columns.

I also need to differentiate between the four conditions (as in, I have data for these for conditions, which I will then analyse in a 2 X 2): ratio known, outcome known (1); ratio known, outcome unknown (2); ratio unknown, outcome known (3); ratio unknown, outcome unknown (4). Those were my four columns for the conditions.

Perhaps I am just not thinking about this in a very logical way!

thanks so much for even considering helping me ... your questions already help me sort out what i mean and what I might need, and that is already a big help.

Kind regards

Laura

William Rose on 1 Jun 2022

@Laura Lennuyeux-Comnene,

Here is a script, processTextFileLLC.m, that extracts the data from the text file. Results are written to a separate text file. The script includes many explanatory comments.

To analyze a different text file, change the input file name (variable infilespec) at the top of the script. The output file name will update automatically.

The text file you sent me, Jellybeans2.txt, did not have a subject ID or other header lines for the first subject. The script has an error when it attempts to analyze this file, because it cannot determine the subject ID for the first subject. Therefore I added a first subject ID (17600) to the input data file. The modified file is called Jellybeans2a.txt, and is attached.

Th results file (Jellybeans2aResults.txt) is tab-delimited text, with six columns. It has a header row with column labels, and as many rows of numbers as there are trials. Practice trials are excluded. You can open the results file in Excel, and sort on a column of interest, such as condition or ratio, to assist in organizing and analyzing the results. Here is a screenshot of the top of the results file, when it is opened with the Windows "Notepad" app.

I compared the first 36 rows in the results file to the 36 rows of results in tab 3 of Excel workbook Jellybeans2.xlsx, sent previously, which I obtained by carefully reading the file. This is why I did the time-consuming process of determining results for thirty-six trials by eye - so that I would have someting to compare to the script, once the script was written, which it now is. There were four numbers, out of 36x6=216 numbers, which did not match. This is because I made a mistake when determining the results by eye. The script is doing things correctly.

Now that you have all these results, you can decide how you want to analyze the results, i.e. what statistical tests to perform, what graphs to make, etc.

William Rose on 7 Jun 2022

Open in MATLAB Online

@Laura Lennuyeux-Comnene,

When the script runs, ir produces this output on the console. You should get a similar result.

>> processTextFileLLC
Input file: Jellybeans2a.txt, output file: Jellybeans2aResults.txt.
Number of subjects=33, blocks=132, trials=528, DBCs=528, Confs=528.
>> 

The script produces a tab-delimited text file. Here are the first 5 lines and the last 4 lines of the results file created by the script.

You should get the same. You can open this text file with Excel or Matlab or your favorite statistics program for analysis. For example, you might want to know if the mean DBC is different for different conditions. Or you might want to ask if Confidence is different for different ratios. This file has eight, not four, ratios, as I explained in an earlier post. If you want to have only be four ratios, then we can make an adjustment in the script to make this happen. If you make this adjustment, then 40 red:60 green will be considered equivalent to 40 green:60 red, for analysis purposes.

"It may take me a while to work out what is going on, as I am rather slow at these things." Compared to people who have been programming in Matlab or some other language for many years, maybe you are slow, or maye not. It is hard reading other peoples' code. They had a strategy in mind when they wrote it, which is usually not obvious. It makes sense to them, and it works, but it can be hard to decipher. And even if you are slow compared to some others, you are trying to learn a programming language, which most invesitgators will not do.

William Rose on 7 Jun 2022

What software do you want to use to do your statistical analysis of the data? Matlab, Excel, R, something else?

What questions do you want to ask and answer with the data you have collected? You and your advisor probably figured this out before you started collecting data. I suspect that you would like to know if DBC varies with conditon, and if DBC varies with ratio. I suspect you would like to know if Confidence varies with Condtion and if Confidence varies with ratio. But maybe I am wrong.

Other thoughts: Again, you and your advisor probably have already figured out what follows. You have two pairs of conditions. Therefore you could analyze condition with a one-factor ANOVA, with four possible values for the one factor. Or you could do a two-factor ANOVA, with two possible values for each factor. There is also the quesiton of whether to do standard ANOVA with an F-test, or do a non-parametric test. The Kruksal-Wallis test is a viable non-parametric alternative to the one-factor ANOVA. There is not a good non-parametric alternative to the two-factor ANOVA. (The Friedman test, which is non-parametric, is not an option, since you have repeated measures in each condition.) See this discussion and this discussion for more info on the lack of a non-parametric alternative to 2-factor ANOVA.

Sign in to comment.

I need to extract specific data and sum them up in different variables

0 Comments
Show -2 older comments Hide -2 older comments

Answers (1)

15 Comments
Show 13 older comments Hide 13 older comments

Categories

Products

Tags

Community Treasure Hunt

I need to extract specific data and sum them up in different variables

0 Comments Show -2 older comments Hide -2 older comments

Answers (1)

15 Comments Show 13 older comments Hide 13 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

15 Comments
Show 13 older comments Hide 13 older comments