**The Battle of Sodas: Who will win?**

Have you ever wondered how much caffeine the soda machine delivers on a regular work day? Would you believe that Coca-Cola is the most consumed soda, and that Diet Coke provides the largest amount of caffeine to MathWorkers?

During a recent water cooler chat this was a topic of intense debate leading us to instrumenting the soda machine to literally measure the amounts of soda dispensed from the machine.

## Contents

## Measurement Setup

The soda machines at MathWorks carry 6 drinks: Coca-Cola®, Diet Coke®, Powerade®, Sprite®, Barq's® Root Beer, and Fanta® Ginger Ale. However, we focused our efforts on the three that were refilled significantly more often: Coca-Cola, Diet Coke and Powerade. We gained a few other insights into our soda drinking behavior, and we thought a write-up of our measurement and analysis approach might encourage others to use "Internet of Things" technologies to answer their own burning questions.

Given the task at hand we opted for a simple system using "off-the-shelf sensors" for measuring the volume of soda dispensed, an Arduino microcontroller, a free third party data aggregator (ThingSpeak.com) and MATLAB for data analysis.

For the sensors, we felt that a proximity sensor was a good choice to find the drink being dispensed and the number of liters dispensed. A proximity sensor has the advantage of being non-invasive, which is important since we do not own the soda machine. An additional requirement was that the sensor could not come in contact with the drink being dispensed. However, an initial test immediately highlighted the drawbacks of using a proximity sensor. Since the soda machine is located in the kitchen, people generally tend to gather close to the soda machine to catch up on their projects (and also gossip). This led to significantly high false positive measurements for the consumption of a drink when there was none.

To overcome this, we chose to use flex sensors instead. Flex sensors are more discrete, and relatively more accurate for our application. However, a big challenge we faced with the flex sensor was to find a good way to attach it to the soda machine. The range of motion of the soda dispense handle is very limited, and therefore the flex sensor had to be carefully attached to detect a drink dispense. The space constraints between the handle and the back wall of the soda machine made it difficult to setup the sensor to satisfy the operating requirements of the sensor. Additionally, the setup had to allow the cleaning crew to easily access the soda machine for daily cleaning routines to stay within FDA guidelines. Eventually, we found that the below attachment of flex sensors to the dispense handle of the soda machine worked well:

The flex sensors were connected to the Arduino Yun using pull down resistors as seen in the below image:

The complete list of hardware we used for identifying the drink being dispensed and measuring the amount of drink dispensed is:

- 1x Arduino Yun
- Power Supply for Arduino
- 3x Flex sensors
- 3x 225 k Ohm pull-down resistors
- 1x Mini-breadboards to build the circuit
- Wires for connecting the different elements

Once the above hardware setup was complete, we needed to program the Arduino to measure the flex sensors’ voltage periodically to identify when a drink is being dispensed and for long it was dispensed. To identify when a drink is being dispensed based on the flex sensor voltage reading, we measured the flex sensors’ voltage when no drink was being dispensed and used this for baseline comparison in the Arduino code. After including the baseline measurement value, we programmed the Arduino to measure the flex sensors' voltage 10 times a second to check if a drink is being dispensed. Based on how long the flex sensor voltage deviated from the predefined value, the total volume of the drink dispensed was calculated.

For our project, we wanted to continuously monitor the soda machine and also visualize the measurement values in "real-time" to satisfy our curiosity. In addition to visualizing the data, we needed to analyze the data to answer our questions. To satisfy both of these requirements, we chose to send each measured value from the Arduino to ThingSpeak.com and then download the data to MATLAB for further analysis.

ThingSpeak is an open application platform designed to enable meaningful connections between things and people. Follow this link for further information on ThingSpeak.com. Since the Arduino Yun supports Wi-Fi connectivity, the measured data from each flex sensor was directly sent to ThingSpeak channel 7040. Click here to see the latest measurement data.

During the week of March the maximum temperature increased by almost two fold, from about to . Therefore, along with answering our initial questions we also wanted to find out if soda consumption depends on outside temperature:

[DateTime, TemperatureF] = importfile('weatherNatick.txt'); figure; plot(DateTime, TemperatureF); datetick('x','mm/dd'); ylabel('Temperature (F)'); title('Temperature from 24^{th} March to 30^{th} March');

Click here for further information on temperature recordings on the specified dates.

While ThingSpeak.com plots the soda volume consumption in liters and provides additional data reduction functionality, to find answers to our questions, we needed to perform additional custom analysis on the data. To perform custom analysis and visualization, we downloaded the data from our ThingSpeak channel to MATLAB using the `thingSpeakFetch` function available here. The `thingSpeakFetch` function enables a user to specify several options while downloading data of interest, like the number of points, number of days, and date range. For more information on MATLAB support for ThingSpeak, check out this resource page.

Note that to reproduce the analysis entailed in the article, you do not need to set up the hardware. Live data from our setup is available on channel 7040 of ThingSpeak.com. The MATLAB code below will be using data from this channel.

## Data Download

To analyze the soda consumption patterns during the week of March , we download data corresponding to the dates of interest from channel 7040 on ThingSpeak.com:

[drinkData, timeStamps] = thingSpeakFetch(7040,'DateRange',{'24-March-2014','30-March-2014'});

The `thingSpeakFetch` function returns data in the form of a two dimensional array, containing the measurement data for each drink as a separate column in `drinkData` variable and measurement time in the `timeStamps` variable. To simplify data handling during analysis we assign each column of `drinkData` array to a separate variable:

cokeData = drinkData(:,1); dietCokeData = drinkData(:,2); poweradeData = drinkData(:,3);

With the data now available in MATLAB, we could start processing it in search of answers to our questions. The first step in understanding any raw data set is visualization. Data visualization provides a unique opportunity to quickly identify macro trends and behaviors that are not evident otherwise. Since the measurement data is updated to ThingSpeak at least once a minute, this data is discrete. A stem plot will be more appropriate than a line plot to visualize discrete data. In the following plots time is displayed along the x-axis in an 'mm/dd' (mm – representing the month in two digits, and dd – representing the day in two digits) time format, and volume of soda is displayed in liters (L). Note that 100 milliliters (mL) is approximately equal to 3.4 US oz.

stem(timeStamps,cokeData,'r'); datetick('x','mm/dd'); title('Coca-Cola'); ylabel('Volume (liters)'); figure; stem(timeStamps, dietCokeData,'g'); datetick('x','mm/dd'); title('Diet Coke'); ylabel('Volume (liters)'); figure; stem(timeStamps,poweradeData,'b'); datetick('x','mm/dd'); title('Powerade'); ylabel('Volume (liters)');

The plots show us clusters of dispensed soda during certain parts of a day. We see that there are no peaks greater than 0.1 liters on March 29th. Co-incidentally, March 29th is a Saturday indicating that the soda machine was operated only during weekdays.

## Noise Elimination

We see in the plot that there are a lot of low amplitude (< 0.1 liter) measurement values. To see these low amplitude values better, let us re-plot the measurement data related to only Coca-Cola with a focus on measurement values between 0 and 600 mL.

stem(timeStamps,cokeData); datetick('x','mm/dd'); title('Coca-Cola'); ylabel('Volume (liters)'); ylim([0,0.06]);

We see that there are a lot measurement values between 0.01 to 0.03 liters in the above graph. For this setup we will assume that a valid dispense of drink requires at least 0.1 liter to be dispensed. This assumption is based on the soda machine behavior and the general soda drinking pattern. This assumption allows us to assign all measured values less than 0.1 liters to be noise and thereby eliminate them. Based on this assumption, we see that `cokeData` contains plenty of non-zero values which do not indicate a drink dispense. These values bias the end result, therefore need to be eliminated.

While low pass filtering can be used to remove some of the noise, the challenge with filtering out all the noise is the selection of a filter which can retain the data in the spikes while eliminating all the low amplitude clutter. The spikes indicating actual dispense of soda have a wide range of frequency components which overlap with high frequency components of noise. Therefore, filter design poses a significant challenge to eliminate only the clutter, while still retaining the signal content. The key point to remember here is that the actual height of the spike due to a dispense needs to be retained for our analysis.

To deal with this problem, a simpler but more effective method is to set all values less than 0.1 to zero. This does not affect the magnitude of the spikes due to an actual dispense and only eliminates the clutter efficiently. Let us start with Coca-cola data. The following command sets all values of `cokeData` less than 0.1 to zero.

cokeData(cokeData < 0.1) = 0; stem(timeStamps,cokeData); datetick('x','mm/dd'); title('Coca-Cola'); ylabel('Volume (liters)');

We see that all values of `cokeData` less than 0.1 have been eliminated while retaining the actual spikes without any distortion. Let us now apply this technique to Diet Coke and Powerade data as well to eliminate any low amplitude noise.

dietCokeData(dietCokeData < 0.1) = 0; poweradeData(poweradeData < 0.1) = 0;

## Hourly Consumption Analysis

Now that the data is noise free, it is ready for further analysis to identify consumption patterns. Since we are interested in identifying patterns at either hourly or daily granularity of time, we aggregate the data in `cokeData`, `dietCokeData` and `poweradeData`. We sum all the measurement values in each hour to aggregate the measured data having a granularity of seconds to that of an hour. Summing the data is appropriate since we are interested in total volume of drink consumed in each hour. We have written the `changeGranularity` function to perform the required aggregation. The `changeGranularity` function requires a time series (data and timestamps), the required granularity of the aggregated time series and the aggregation function as inputs. The function outputs a new time series and the appropriate timestamps:

[cokeHourly, hourlyTimeStamps] = changeGranularity(timeStamps,cokeData,'hour','sum');

To the `changeGranularity` function we provided Coca-Cola time series data (`cokeData`, `timeStamps`), required granularity level - 'hour' and the abstraction function – 'sum' as inputs. The output is a new time series (`cokeHourly` and `hourlyTimeStamps`) at hourly granularity. In this case, we used 'sum' as the abstraction function since we wanted to calculate the total drink dispensed during each hour. You can also use other abstraction functions like {'min', 'max', 'mean', 'median', 'first'} instead to suit your application.

Let us graph the new time series for hourly consumption of Coca-Cola along with the temperature variation across the days of interest:

subplot(2,1,1); stem(datenum(hourlyTimeStamps),cokeHourly); ylabel('Volume (liters)'); datetick('x','mm/dd'); subplot(2,1,2); plot(DateTime,TemperatureF); datetick('x','mm/dd'); ylabel('Temperature (F)');

On graphing the new time series for the hourly consumption of Coca-Cola, we see that the maximum hourly consumption of Coca-Cola is decreasing across the days of data collection. This is very surprising given that the temperature was actually increasing across these days. This information tells us that Coca-Cola consumption was not driven by outside temperature during the observation period. We resample Diet Coke and Powerade data as well to an hourly granularity:

[dietCokeHourly] = changeGranularity(timeStamps,dietCokeData,'hour','sum'); [poweradeHourly] = changeGranularity(timeStamps,poweradeData,'hour','sum');

## Outlier Elimination

The data now is cleaned of noisy low amplitude values and also resampled to an hourly granularity. Along with the low amplitude noise, we also noticed during testing that if the sensors setup was accidentally moved during daily cleaning, then this would lead to a system malfunction where a large amount of soda consumption would be falsely reported. Another behavior we noticed was that when a drink was dispensed, if the sensor setup had been disturbed, then the flex sensor voltage would gradually decrease over a few minutes instead of instantaneously dropping to the baseline value. These two error conditions are easier to identify at an hourly granularity than at a granularity of seconds. Now that the measurement data has been resampled to an hourly granularity, we can remove these false measurement values. These false values, if not removed, will adversely affect the eventual conclusions.

Based on some initial tests we ran, it is safe to assume that a maximum of 5 liters of drink is consumed in one hour. Therefore any value greater than that will be set to zero. Let us start with Coca-Cola data:

cokeHourly(cokeHourly > 5) = 0;

Since we are interested in analyzing consumption patterns in Diet Coke and Powerade as well, we apply the same logic to `dietCokeHourly` and `poweradeHourly` as well:

dietCokeHourly(dietCokeHourly > 5) = 0; poweradeHourly(poweradeHourly > 5) = 0;

## Daily Analysis

With the data now void of noise and outliers, we can view the data at a weekly granularity to check if there are any trends across the different days of the week. To achieve this, the time series data will be folded into days and represented as a two dimensional array with each row indicating an hour of the day and each column indicating a day of measurement values. We have written the `reshapeTS` function to perform the required reorganization. The `reshapeTS` function requires a time series (data and timestamps) as an input, and outputs the reshaped time series:

[cokeDaily,dailyTimeStamps] = reshapeTS(cokeHourly,hourlyTimeStamps);

To the `reshapeTS` function we provided the Coca-Coal time series data (`cokeHourly`, `hourlyTimeStamps`) and obtained the reshaped time series data (`cokeDaily`, `dailyTimeStamps`), which is a two dimensional array as described above.

With the drink consumption data in a two dimensional format, we can graph the data for the entire week to visualize the consumption across both the hours of each day and each day of the week:

figure; surf(cokeDaily,'FaceColor','interp'); xlabel('Measurement Day Index'); ylabel('Hours of the day'); zlabel('Volume in liters'); title('Coca-Cola Consumption'); set(gca,'YTick',1:24); set(gca,'YTickLabel',num2str([0:23]')); view([65,52]);

## Coca-Cola in the morning?

From the above graph it is hard to visualize any underlying patterns in the consumption of Coca-Cola across the days of observation. Therefore, we wanted to look for daily consumption patterns. `cokeDaily` has data for the entire day (24 hours). We can break it into two segments, one for the first 12 hours of the day and the next for the second 12 hours of the day:

cokeDailyMorn = cokeDaily(1:12,:); cokeDailyAft = cokeDaily(13:end,:);

We can compare Coca-Cola consumption for the morning vs afternoon to look for any consumption patterns across the days of measurement. We sum the 12 hour measurements in `cokeDailyMorn` and `cokeDailyAft` for each day of measurement:

cokeDailyMornSum = sum(cokeDailyMorn); cokeDailyAftSum = sum(cokeDailyAft);

Next we plot a bar chart for the morning and afternoon sums for each day next to each other. A bar graph is a chart with rectangular bars with lengths proportional to the values that they represent:

bar([cokeDailyMornSum',cokeDailyAftSum']); title('Coca-Cola: Morning vs Afternoon Consumption'); ylabel('Volume (liters)'); xlabel('Days'); legend('Before 12pm','After 12pm');

Interestingly, the graph indicates that MathWorkers in general drink more Coca-Cola in the morning as compared to afternoon. This is true regardless of the exterior temperature. We can check if this is true for the other two drinks as well. We would need to reshape the data into a two dimensional array first, as we did for `cokeHourly` and then finally graph the results:

Diet Coke

[dietCokeDaily,~] = reshapeTS(dietCokeHourly,hourlyTimeStamps); dietCokeDailyMornSum = sum(dietCokeDaily(1:12,:)); dietCokeDailyAftSum = sum(dietCokeDaily(13:end,:)); bar([dietCokeDailyMornSum',dietCokeDailyAftSum']); title('Diet Coke: Morning vs Afternoon Consumption'); ylabel('Volume (liters)'); xlabel('Days'); legend('Before 12pm', 'After 12pm');

Powerade

[poweradeDaily,~] = reshapeTS(poweradeHourly,hourlyTimeStamps); poweradeDailyMornSum = sum(poweradeDaily(1:12,:)); poweradeDailyAftSum = sum(poweradeDaily(13:end,:)); figure; bar([poweradeDailyMornSum',poweradeDailyAftSum']); title('Powerade: Morning vs Afternoon Consumption'); ylabel('Volume (liters)'); xlabel('Days'); legend('Before 12pm','After 12pm');

From the above graphs, we see that Diet Coke follows an inverse pattern. It is consumed more significantly in the afternoon than in the morning. A similar pattern is seen for Powerade consumption as well.

Next, we would like to gain more clarity into the hourly consumption patterns of these drinks across the days of interest. Note that in the above two dimensional array format, each row of the daily consumption data in the `cokeDaily`, `dietCokeDaily`, and `poweradeDaily` variables, corresponds to the total drink consumed at a particular hour of a day. We can arrive at the hourly consumption statistics across all days of observation by calculating the sum of elements in each row:

hourlySumCoke = sum(cokeDaily,2); hourlySumDietCoke = sum(dietCokeDaily,2); hourlySumPowerade = sum(poweradeDaily,2);

The variables `hourlySumCoke`, `hourlySumDietcoke`, and `hourlySumPowerade` now contain the total sum of each drink, in liters, consumed at each hour of a day. We use a bar chart to visualize the total hourly dispense of each drink as it allows for easy visualization of relative dispense amounts.

bar([hourlySumCoke,hourlySumDietCoke,hourlySumPowerade],'stacked'); xlabel('Hour of the day'); ylabel('Volume (liters)'); title('Total Drink dispensed'); legend('Coca-Cola','Diet Coke','Powerade');

An interesting pattern seen in the graph is that the soda machine is used the most between 10am and 5pm on the observed days. Additionally, it is also seen that Coca-Cola consumption decreases through the day while Diet Coke consumption increases through the day - verifying our previous analysis.

## Counting Calories, and Caffeine

To calculate the number of calories, and caffeine provided by each drink during a particular hour of a day we use the following information found on the internet to calculate the hourly dispense of each:

- 1 liter of Coca-Cola contains about 96 mg of caffeine and 395 calories
- 1 liter of Diet Coke contains about 126 mg of caffeine and 11 calories
- 1 liter of Powerade contains about 0 mg of caffeine and 231 calories

We first look at the calories dispensed during each hour across the days of observation for each drink:

hourlyCaloriesCoke = hourlySumCoke * 395; hourlyCaloriesDietCoke = hourlySumDietCoke * 11; hourlyCaloriesPowerade = hourlySumPowerade * 231; bar([hourlyCaloriesCoke,hourlyCaloriesDietCoke,hourlyCaloriesPowerade],'stacked'); xlabel('Hour of the day'); ylabel('Calories'); title('Total Calories dispensed'); legend('Coca-Cola','Diet Coke','Powerade');

Finally, the total caffeine dispensed during each hour across the days of observation for each drink:

hourlyCaffeineCoke = hourlySumCoke * 96; hourlyCaffeineDietCoke = hourlySumDietCoke * 126; hourlyCaffeinePowerade = hourlySumPowerade * 0; bar([hourlyCaffeineCoke,hourlyCaffeineDietCoke,hourlyCaffeinePowerade],'stacked'); xlabel('Hour of the day'); ylabel('Caffeine (milligms)'); title('Total Caffeine dispensed'); legend('Coca-Cola','Diet Coke','Powerade');

## Answers to our questions

The graphs help us find answers to the questions we started with: From the 'Total Calories dispensed' graph we can find the calories dispensed at a particular hour. We see that Coca-Cola provides the largest amount of calories at every hour, even though the number of liters of Coca-Cola is comparable to the number of liters of Diet Coke. Interestingly, Coca-Cola supplied the largest amount of calories as compared to the other drinks at 10 am in the morning. A fun fact is that at 10 am across all days of observation, Coca-Cola supplied as many calories as 24 bowls of cereal! We also see that Powerade is the least consumed drink during the days of interest. Further, we can also see that most consumption of each drink is between the hours of 8 am and 5 pm, indicating the great work-life balance of soda drinking MathWorkers.

As one would expect, we see that the caffeine delivered by the soda machine is significantly larger between 9am – 10 am and between noon - 2 pm, indicating the caffeine boost required to keep up the high performance levels pre and post lunch for soda drinking MathWorkers.

As you might have figured out already, this application can be adapted to any drink dispensing system such as a beer dispenser in a bar or frozen yogurt dispenser. In addition to monitoring consumption levels, this setup can also be extended to measure in real-time the total amount of drink or frozen yogurt dispensed. This in turn can be used to send out alerts before the machine runs out. Go ahead, extend this application to your requirements. Learn more about analysis and visualization capabilities offered by MathWorks tools by clicking here.

We would like to remind you that to reproduce the analysis entailed in the article, you do not need to set up the hardware. Live data from our setup is available on channel 7040 of ThingSpeak.com. Click here to access all the MATLAB code used in this article.

If you do not have access to MATLAB and would like to get a trial copy, please click here. To purchase a copy of MATLAB please click here. If you are a student, then click here to purchase MATLAB for student use.