# Analyzing Experimental Data and Making Pretty Plots in MATLAB

## Overview

Are you a student or a researcher interested in using MATLAB? Do you want to use MATLAB for exploratory data analysis? Have you ever wondered how you could plot publication-quality plots and figures in MATLAB?

### Highlights

- Importing and cleaning your data
- Analyzing your experimental data
- Generating publication-quality plots and animations

### About the Presenters

Spandhana Gonuguntla, PhD

Education Technical Evangelist MathWorks India Pvt. Ltd.

Spandhana Gonuguntla is an Education Technical Evangelist at MathWorks. She works with faculty and researches across India to enable effective use of MATLAB and Simulink in their curriculum and research. She specializes in Data analytics, mathematical modelling, Machine Learning and Deep Learning workflows. She holds a PhD in Chemical and Biomolecular Engineering from National University of Singapore and B.Tech in Chemical Engineering from National Institute of Technology, Surathkal. Her focus during her PhD includes Smart Actuating Systems using mathematical models.

Lakshminarayan Viju Ravichandran, PhD

Education Technical Evangelist MathWorks India Pvt. Ltd.

Dr. Lakshminarayan Viju Ravichandran leads the Education Technical Evangelist team at MathWorks India Private Limited and works with universities in India focusing on the application of MATLAB and Simulink in curriculum development and research. His interests lie in finding synergies between industry and academia. He has over 7 years of experience in academic research and 7 years in the industry. Prior to joining MathWorks, Viju worked as a Research Associate at the Arizona State University, Tempe and was involved in the development and design of signal processing algorithms, analysis of biological data, and architectural implementation of signal processing algorithms. As a post-doctoral fellow at Emory University, Atlanta, he worked at the Department of Radiology and Imaging Sciences developing algorithms in image processing for medical data. He has published and reviewed papers in multiple peer-reviewed conferences and journals.

He holds a Bachelor’s Degree in Electronics and Communication Engineering from Bangalore Institute of Technology, Bangalore and a Master’s degree and a PhD degree in Electrical Engineering from Arizona State University, Tempe.

**Recorded:
19 Oct 2020**

Here with me, I have Mr. Akhil, Deb, and Prafulpi who will be supporting the Q&A panel. A little bit about myself-- my background is in chemical engineering. And I studied my bachelors and my PhD in chemical engineering. My research itself was at the intersection of, well, chemical engineering, material science, and biomedical engineering. So apologies in advance if you end up seeing a lot of examples in these domains.

So today's agenda will be looking at some of the challenges that experimentalists face when they work with data and when they want to do data analysis and plotting in MATLAB. We'll also look at a few types of plots in MATLAB and some form of advance types of plots in MATLAB. We'll see how you can easily customize your figures in MATLAB and what are the best practices for data analysis and visualization.

So MATLAB is basically a programming environment. It's a high-level programming language. It is designed for engineers and scientists with powerful math graphics and programming capability. So let's see-- I'll give a number of demos today on how to use MATLAB, how you can script your-- carry out your data analysis in MATLAB.

So what else is MATLAB used for? MATLAB has a number of toolboxes, especially in the space of machine learning, data analytics, deep learning, image processing, and so on. We have a number of webinars. We have webinar, I think, an upcoming one on optimization and the next one on data science and machine learning. So if you're interested, you can register for those webinar series. And you can learn more about it there.

Overall, MATLAB is a very powerful environment for researchers and scientists, so that they can quickly churn out data and quickly plot plots and have a script which can represent your entire task in the form of a live editor. But before that, let me explain to you one small project that I carried out in my PhD study. So here, we were developing a handheld [? UE ?] spectrometer.

So this particular spectrometer, we had designed it from scratch. And we were specifically trying to see how the spectrometer performed for one particular-- one particular experiment, that is to identify pH of solutions. So instead of using a pH meter, we wanted to use a spectrometer to identify how the spectrometer was performing.

This is a standard experiment, which perhaps many experimentalists do in your own research. That is, you will have to generate a calibration curve. You'll have to correlate your output variable, which, in my case, which was the absorbance of my spectrometer, versus is the dependent variable, which is the alkali concentration that I was using.

So basically, the experiment was like this. I added some dyes. And I wanted to visualize the intensity of that dye using a handheld [? UE ?] spectrometer. And this gives me an understanding of what is the dynamic range that the instrument performs in and what is the errors that the instrument was giving. And basically, I wanted to study my instrument better.

So what we did-- first and foremost, we had to design this experiment. How does one design an experiment? this is a very crucial step for any experimentalist, designing your experiment well. If you have a good experimental design, that is half the task done. So you don't have to go back and redo some of the experiments or the data points that you missed, and so on.

So typically, if you're working in, say, biology space, if you are working with concentrations, you probably do a two [? serial ?] dilution. For me, I wanted to have a more linear plot, so on a linear axis. And two-fold dilution wouldn't make much sense to me because it would end up in a log plot.

So what I did was I had to specifically identify my buffers, and then calibrate my pH solution so that I have equally spaced data points. So once I have designed that my experiment will have these many data points and these data points will be equally spaced, and then I added-- I added a required amount of dye to quantify what absorbance value was. So let's see this in the form of a live script, how we can do a simple data analysis like this using MATLAB.

So we are going to build a calibration curve. And most of my data at that point was being taken through an app and was being exported into Excel, which is the standard data format for most of the instruments. So my data was all in Excel at that point. And let me show you what that Excel data looked like.

So this is the Excel sheet that I had generated. So I had a number of runs done for a particular concentration-- number of repeats, I mean-- repeat 1, repeat 2, repeat 3-- because every data point I took I had to change the tape and take it again, so that there was no interference from any other external factors.

So for a given concentration, I had three different data points that were being generated and stored in an Excel format. And this particular data set is now an Excel sheet. Of course, you can go ahead and do mean, median, standard deviation. You can find out all of this. Excel is quite a powerful environment like that.

But what I wanted to do was I wanted to scale my application. So today, I am working with one particular assay, where I had to work with just OH minus concentration. But in future, I could end up getting hundreds of such Excel sheets. So it doesn't make sense to me to manually identify what are the mean, median, standard deviation and manually plot all of this. And that was the reason I chose MATLAB. And I thought it was a better idea to choose MATLAB because of this form of automation that I could do in future and scale my application.

So first and foremost, I had to load my data into Excel sheet. So loading Excel sheet data into MATLAB is very, very simple using this particular command known as readtable. So I can read my Excel sheet data, which was in this experimentaldata.xlsx. And I have also chosen to read all the row names.

So you can see the row names, which were OH minus concentration and the three repeats. All the data has been now loaded in this output over here, where now we can go ahead and do any computation on this.

But before this, I wanted to see and visualize whether my experimental data was good enough before I could continue with the analysis. So I just want to plot a scatter plot. This is the first form of plot typically scientists do, just to visualize where their data is, how does it look like, whether you're working in the right dynamic range or not. So this kind of information is very critical at this point, so that you don't spend effort in analyzing your data even before you understand whether your experiment has actually been somewhat successful.

So here, using this particular dropdown menu, which is available in the Live Script, Live Editor tab in the Insert section, I'm going to explore three different trials that I have done, all three different replicates that I have, the repeats. So this is the result that I obtained. I can see that my experimental data is somewhat saturating at the high concentration levels. And let me check what happens with the repeats 2 and 3.

So repeat 2 is somewhat fairly linear. And repeat 3 is also fairly linear, perhaps. Just give me a moment while it runs. Yeah, also fairly linear-- so I know that my experiment is somewhat successful. There are no visible outliers or anything.

And this experiment can now be continued for analysis. It can be taken for analysis. But before that, what is a good practice in any form of data representation or visualization?

One good practice is first you need to have a title for your graph. So that is very easy to put in using MATLAB live scripts. So the command for putting your title is or you just specify the function title. And then you specify what title you want to give to your plot.

And x-axis and y-axis-- most of the undergraduate students that I was TAing, they almost always forgot to label their axis. So it is very, very important that, irrespective of whom you are showing this data, it is very important to know what this x-axis is, what are the units. All this is very important. So you can put it right away using the labels for x and y, using xlabel and ylabel functions. You can use the appropriate labels here. For me, it is OH minus concentration, which is defined in molar units; and my y-axis, absorbance, which is typically represented as AU, arbitrary unit, depending on the instrument specifications and so on.

So now that I have three different plots, what is the best way to represent these three replicates? These three replicates, the best way to represent this is by computing the mean and the standard deviation in my scatter plot. So let's see how we can compute the mean and standard deviation.

In this case, computing mean is very, very simple because MATLAB is a high level programming language. Given that MATLAB is a high level programming language, you don't have to specify a for loop to read all this data. You don't need a for loop essentially because the way MATLAB is designed is any data that comes into MATLAB is treated as a matrix.

So here, this particular table is read as a matrix with the some characters, some numerics, and so on. So it's fairly simple for MATLAB to analyze-- when I say MATLAB to compute mean, without having any for loop, I can ask MATLAB to compute mean of this particular table, Experimental Data, of all the rows from rows 2 to 4.

So I can compute a mean and I can compute the standard deviation. And then I can plot them again on a scatter plot. So this is how a scatter plot with just the mean values looks like. And again, since we also have standard deviation information, which is very important for my research, because I want to know how much variability exists in my instrument, it is important to plot this.

So to plot any error bars in MATLAB, you can use this Error Bar function and specify the x data on y data, and the standard deviation, and the plot type. So here, this all specifies the circle that is being plotted over here. Similarly, you can get various types of markers, like square, triangle. Whatever marker you want, you can specify that. And we have the error bars. I have grid on here. But if you don't like or don't want to have grid, you can just specify it as grid off command. And again, x-label, y-label, and, of course, the title of this plot is essential.

So at this point I have, somewhat analyzed my data. I think this is quite good. And I want to get a calibration curve. What is a calibration curve? I want to calibrate the output of my instrument for this particular assay, which was a pH concentration-- I mean, concentration of OH- that I was studying. So I want to calibrate my instrument for studying this particular OH- concentration ion.

So how do I do that? I can either fit a line through these points, and then find out what the equation of the best fit line is. And using that, I can interpolate if I have an unknown sample, which, if I have an unknown sample, I can easily identify what the absorbance is using my spectrometer. And then I can correlate what the concentration of OH- should be for that unknown sample.

So this is the usefulness of any calibration curve. Especially if you're working in the instrumentation field, you would be aware that calibration curves are almost essential irrespective of which instrument it is you're working with. Here, I'm taking an example of a spectrometer. It could be pH meter. It could be any form of meter, a number of those. So you always need a calibration curve to make sure-- or to identify any form of reading that your instrument gives out to the more useful reading that you can comprehend.

So to do this in MATLAB, we are going to use an app called Curve Fitting app. So in the Apps tab in MATLAB here, I'm on the tool strip at the Apps tab. There is an app called Curve Fitting app. This is fairly useful for fitting any type of data. So let me give you a brief walkthrough of how to use this Curve Fitting app.

So first and foremost, I need to choose what is my x-data. Here, I have my x-data as x-data and then my y-data as y-data. So I need to choose the appropriate variables that I'm working with. And I can choose the function that I want to fit.

So in this dropdown menu, I can choose any type of function. For my particular application, I need to use-- I need to fit a line here. So line is polynomial of degree 1. So I can choose this to fit a line here.

On the left here, I can see the function that I'm trying to fit. That is if x is equal to p1 x plus p2, where p1 and p2 are the parameters, or the slip and intercept of this line. And MATLAB will automatically evaluate for you what are these parameters and also what is the goodness of fit. Here, I can see that my r squared is 0.99, which is excellent, which is a very good fit.

So I can export this curve fitting-- I can export this entire thing as a code. And I can put it in my live script. Ultimately, I can study with different types of polynomials.

If you're working with the ways types of fitting-- you have Gaussian fitting, exponential fitting, power fit, variable fit, and so on. There are a number of types of fits. And also, if you are working-- or if you're working with a more complicated equation, which is very custom to your own application or you're building a model which is very different and cannot be described by any of these standard functions, you can define your own equation in this particular type called Custom Equation.

So once I have this fitted data, what I can do is I can go to File. And I can generate code to save this-- or to automatically generate code, so I can use it the next time. So I can go and generate code. And this is how the generated code will look like. So you have a function that is automatically generated for you. And this particular function, it's named as Create Fit. And it's taking in the x-data and y-data as I specified from my MATLAB live script.

And also, since I had chosen a linear fit, it has already chosen the fit type as Poly 1, which is polynomial of degree 1. And then the output of this particular function will be the fit result and the goodness of it. So if I want to save this particular function or this auto-generated code, what I can do is I can just save it in my current folder by pressing the Save option, and then saving it over here.

So let's see. So I have a function here, createfit.m. And if I want to call this function, all I have to do is take this line and put it in my MATLAB script. So I can have this particular output. And my output will now be shown as variables of fit result and goodness of fit.

So this is the output that I have. Of course, at this point, I have not yet specified the correct labels and so on. I can do that fairly easily. I have already shown how that can be done. So what I'm more interested in is how we can make use of this linear fit for unknown concentration data.

Say you have been supplied with an unknown pH solution and you want to use your spectrometer to identify what the pH is. You can easily measure the absorbance value. And then you can use this particular function. So now, you have the y-value known. And you can estimate what this unknown x is. It is fairly simple using MATLAB.

So far, we've seen how what is the usefulness of having scatter plots, what is the best practice for plotting in a way-- the axis labels, and titles, and so on. So now, let's get back to our PowerPoint presentation for a second experiment that we are going to do. If you have any questions at this point, please put it in the Q&A section. And we'll take a look at it at the end.

So in my second experiment, what I'm going to do is I want to study-- I want to design an experiment to study the effect of two brands of fertilizers. I have fertilizer of brand A and of brand B Now, I have to design my experiment. How do I get started?

First and foremost, I need to design an experiment to identify how much of this fertilizer am I going to add in my pot of tomato plant. So I need to first decide how much quantity I need of this particular fertilizer. That is quite important. So I have to identify this amount. And say I have one pot here, where I will test this brand A and another pt for testing brand B. Is this a good experiment?

You can put your answer in the chat. Is this a good experiment? I have one pot for checking whether brand A is good and one pot for brand B. Is this a good experimental design?

OK, this experiment is not ideal at this point because it doesn't have any replicates. How do I believe that the output I get or the measurement that I get is useful or I can believe in it? So you need to replicates. At any point when you're designing any experiment, you need to have, say, at least three replicates is what is required.

So I'm going to have four pots. And I'm going to have-- I'm going to do the same experiment on all of them I'm going to add brand A fertilizer in all of this, brand B fertilizer in all of this, of the same quantity, and so on. Is this a good experiment now, where I'm testing for brand A versus brand B and I have replicates? Is this a good experiment?

OK, I'm getting some answers which say that this is a good experiment. Yes, this is much better than what I had previously done. But still, this is still lacking. We need something known as a control group.

A control group is basically that group of an experiment where you're going to carry out the same protocol that you have for testing the effect of these fertilizers. But your control group will not have any fertilizer. This particular control group, it will follow the same protocol. You will water the plants as many times as your water for brand A and brand B. And it will run in the same period of time. And these plans will be in the same-- will receive the same sunlight, everything same.

What is different? We are skipping the fertilizer aspect. And why is this important? This is a very crucial step in the experimental design, having a control group. This is important because you want to verify whether you need a fertilizer in the first place.

And is your fertilizer outperforming your natural plant growth? Is it enhancing the plant growth? Or is it suppressing the plant? This kind of information is important for an experimentalist to have. So it is always important to have what is known as a control group.

So in this particular experiment, we will have a negative control. In controls, there is something known as a negative control group and a positive control group. So this particular control, where we are skipping the main ingredient that we are testing for, is known as the negative control.

Alternately, there's something known as a positive control, which you might be very familiar with if you're working in, say, testing antimicrobials, for example. So I worked in a research lab where I had to test out antibacterial agents. Now, when we were testing out these antibacterial agents, we have a bacterial stock, stock solution, where the bacteria is growing healthily, hopefully. And we are synthesizing so many of these compounds, say, a new type of antibiotics which will be then tested on this bacterial stock.

So what we'll do-- we will have a test where we are testing the antibiotic agent 1, agent 2, and so on, and, of course, one group where we just want to monitor the bacterial growth and have no agent at all. That is the negative control. What will be the positive control, in this case?

The positive control is that group where I am testing it with a known antibiotic. So in this case, positive control is very important because we will need to know whether your bacterial stock is appropriate or not, so whether it's growing as it should or whether there are some elements in your assay which is basically ruining your entire experiment. You will get to know this when you have the appropriate control group. So this is very, very important.

So you must have heard about this type of plot, if you are from biology background. You can see a lot of these in publications, a lot of bar plots essentially. So when and where to use these bar plots? And what is the importance of using this? Why not use a scatter plot? is all some of the questions that might be running in your mind.

Say if I used a scatter plot for this example. I can very well use a scatter plot. I'll have three data points. And then I'll have a [? row bar ?] signifying these three data points because I'm grouping them according to, say, a control group, brand A, and brand B fertilizer. And presenting something like this is essentially very difficult.

I mean, if you understand, it's not very easy for a reader to understand, see this, and right away get the impression that a particular type of fertilizer it is acting very well. And also, when you think about publishing it, publisher will tell you to not have as many white spaces as you have. So it becomes very challenging to represent it in a scatter plot.

So the best form of representation for this type of plot is a bar plot, and a bar plot, which can represent your experiment with brand A, brand B, and no treatment. So let me show you the experimental data for this particular experiment that we conducted. So again, this experiment was recorded using MATLAB. And then we have imported it into MATLAB. So let me show you how the Excel sheet looks like.

As described, even recording your observations in an easy manner, in a manner in which you can understand, and also make it useful for computation is also very essential. So you can't have them in, say, different cells and then probably look at different cells manually and compute your averages. If it's a small experiment, that works. But if you're going to work with a large number of experiments-- and even though you're recording it manually, you need to make sure that you follow some form of formal format to keep these observations and compute them in a faster manner.

So what we've done-- we have three groups. Three groups-- one group of plants with no treatment at all; a second group, which was treated with brand A fertilizer; third group with brand B fertilizer. And these are all recorded row-wise. And we have plant 1. So this is, say, plant species 1, which was treated with the brand A fertilizer, a replicate, and so on.

So eventually, I can compute my mean and standard deviation. Fairly simple, right? Row-wise, I can do my-- sorry, I can do my computation column-wise to identify what the mean of all these cells are and what the mean standard deviation of all these are. This is fairly simple.

Now, I can either do it in MATLAB or in Excel. For this particular experiment, it was very simple to do it in Excel. So I have computed it in Excel. Now, let's get this data into MATLAB.

Again, what command will you use? Very simple, readtable can take everything. It can read your titles. It can read your cells with characters, with numerics, with date times, and all types of formats. So you don't need to worry about whether a particular format is supported by MATLAB or not.

Now, I have your this plant 1, 2, 3, 4. And then I've computed already mean and standard deviation. So first and foremost, before I start to plot, I need to identify what is my x-axis and what is my y-axis. To identify the axis, what we need to do is I'm going to create a variable, xdata here, to show that I have a picture of what I need in my mind.

We want to have-- in the x-axis, we are going to differentiate based on the groups, group of treatment-- so either no treatment, or brand A treatment, or brand B treatment, and so on. So here, we have represented this, identified the x-data, identified the y-data, and then, of course, labeled our y-data and plotted this along with the error bar.

It's very simple to plot the error bar as well. You have to just specify your x-data, y-data, and what the error is. Error, in this case, was the standard deviation, which we extracted from the table.

So this is one type of useful plot. And bar graph also outshines when you have to compare the same treatment on, say, n number of types of plans. So now, let's take an example of doing the same experiment. But this time, we are going to look at two different types of plant species. One is our tomato plants. And second one is the pumpkin species. The experimental design, again, will remain the same, where we are going to treat with 0.5 grams of brand A, 0.5 grams of brand B fertilizer, and a control group with no fertilizer.

And now, we are going to run this experiment, where now we probably have to represent this data-- you can easily represent it using this type of bar graph. This is called group bar chart. This is very useful. You can go up to three or four. And it'll still make sense to have this type of grouped bar plot.

What happens if you have-- or if you're testing 10 or 20 such species? It becomes fairly cramped, I assume, in your x-axis. And a reader cannot easily read this graph. It will become very challenging for a reader to read it, in my opinion. So what kind of graph will you then use if you have 10 or 20 species per group?

In this case, what makes sense is looking at what is known as a grouped bar chart. Sorry, this is known as a stacked bar plot. So if you have 10 or 20 species, you can represent them fairly simply using a stacked bar plot. And how do you read this stacked bar plot? So you can read that tomatoes grew about 15 centimeters, pumpkin creepers grew about 30 centimeters, and so on without any treatment. And similarly, you can go on adding stacks to this particular plot. And you can also identify which ones have been best-- which fertilizer best performs or which species using a stacked bar plot.

So these are a few types of plants which perhaps you will come across in your everyday research-- the scatter plot and the bar plot, which, I think, typically-- or at least 70% to 80% all plots that are being plotted in publications are perhaps these, the scatter plots and the bar plots. Now, let's look at a different type of study.

Say you are looking at identifying the metric that is blood glucose concentration for diabetic patients versus healthy patients. And here, you have 50 samples. You are conducting a fairly large study. You have 50 samples. You're drawing blood of 50 individuals and identifying what the blood glucose concentration is. But this set of individuals are diabetic and this set of individuals are not diabetic.

So if you are using a standard bar plot, you cannot represent more information than just the mean and the standard deviation. But when you're working with large groups of data, it is important to know the metrics of how this data is scattered. Or some form of descriptive statistics on the data itself need to be visible on the plots. So why do you think is the best way to represent this particular data?

So introducing to you what is known as box plot. So box plot is a type of plot where you can represent some form of descriptive statistics in your plot. So here, we were looking at diabetic patients and healthy patients. And then we were measuring the blood glucose concentration.

So for this particular experiment, using box plot, you can identify where the median is of diabetic patients, what is the median of the blood glucose concentration, what is the maximum and minimum concentration that was observed among your sample space, and what is the upper quartile and the lower quartile. Using this, you can identify how skewed your data is. You can also identify what the outliers are, and so on.

So why is this type of data important? Say WHO now-- I think many years ago WHO had given out blood glucose as a representative-- as a representative variable for identifying whether a person is diabetic or not. And what they specified-- if the blood glucose concentration is above 200 milligram per deciliter, you can call that person as diabetic. You can classify them as diabetic. If they are healthy, their blood glucose concentration should be below 140 milligrams per deciliter.

So how do you think they came up that these metrics? Of course, they would have conducted a number of studies, recruited a lot of people to identify and segregate this-- identify what the blood glucose concentration of the two groups of people should be. And finally, they identified that, if they were using blood glucose concentration as the measure, there seems to be no overlap between this diabetic group versus this healthy group.

This is very important for them because, if there was any form of overlap, this particular value of the blood glucose concentration is not the right metric to look at. So you should look at something else, maybe some other markers, some other antibodies perhaps, or something else. But just measuring glucose concentration is not the right way.

So this type of plot that-- this type of plot is very, very useful to identify how your data is falling across in this entire sample space. You can do some basic form of pre-processing. So you can visualize the summary statistics, identify outliers, and also compare distributions across various groups.

So now, let's look at the current experiment that we are going to conduct. Now, this is an experiment-- I don't know how many of you here are from chemistry. But this is a surface hydrogenation reaction. So what we're doing is-- this is a research group working on ketolysis. And they want to identify the right catalyst for hydrogenation.

So they're testing various metals-- platinum, palladium, nickel, and so on. They're testing all these five types of metals. And as you know, surface hydrogenation reaction, some form of absorption, and then the catalyst catalyzes this reaction, where you're basically adding hydrogen to your unsaturated hydrocarbon.

So let's see how this research group went ahead to identify what the right catalyst was. In my opinion, this is a best case for using a box plot because you can identify how your experiment performed for these various classes or various of your catalyst. And you can represent it in one sheet.

You can also study which ones have the lowest variance. You can study which ones have the highest variance. And you can also study which group is more appropriate to you, based on a simple plot representation of a box plot. So let's see how we can plot something like this in MATLAB.

So here again, the data again comes from an Excel sheet. So let's see what this Excel sheet actually consists of. So again, if you remember, what is the best practice? You will have repeats in your experiments. Here, we have about 15 repeats for the same experiment. And we are testing these five types of catalysts.

This is my data. And we are looking at loading this data-- loading the data using readtable. And then we are-- plotting this is as simple as just specifying the columns that need to be plotted. So command box, use the function boxplot to specify from your table. This is catalyst data. Specify that you want to plot all the rows and second column onwards.

So second column onwards, until the 15th column. So that is specified by 2 to end. And then you can specify what labels you want to give for your x-axis. The labels come from this particular column. The catalyst labels comes from this particular column. So it's as simple as this.

So you can visualize these plus marks that you see. These are basically the outliers. And you can specify the y-label because, by default, that is not picked up. And you can specify it manually. And the title of your graph, that is Effect of Catalyst on the Yield Percentage.

So now, I have plotted my box plot. It was just four lines of code, if you imagine. So is there any easy way to plot when you don't have to specify all these-- or write any lines of code? In MATLAB, you can see there's this Plots tab. And if you choose the right variables that you want to plot-- for example, let me show an example of y-data here and x-data, if there's-- yeah.

So if you specify the right type of variables-- say, for example, here I'm choosing these two variables-- this Plot tab gets activated. And without even having to write any lines of code, you can directly choose some simple plots right here-- scatter plot, bar plot, area plot, line plots, pie charts, histograms. Everything is available right here.

In MATLAB 2020b, there are even further improvements to plotting. You can see bubble plots, Pareto charts, and so on as well. So there are a number of plotting tools that are available within this Plots tab. So you can just select your x-data, y-data. And then you can just specify this, or choose this and create a plot. It looks just as simple as this.

And then you can go here. This is known as a Property Inspector. And then you can specify a lot of things, like whether you're ticks or not, whether you want minor ticks, major ticks, grid lines. And the title of your plot, the labels, everything can be specified in this Property Inspector. This is a very handy way.

Once you have specified all of that, you can go to File and generate code to get a code of all the manual steps that you did. And then once you've generated this code, any further plots that you have, you don't have to manually write any lines of code. So MATLAB is very, very easy for you to get started for plotting.

Now, as I said, we looked at how our research group working on ketolysis was identifying the right type of catalyst for them. So we saw that they were studying five types of catalyst. That is platinum, palladium, rhodium, ruthenium, and nickel. And we were seeing how-- and they had to pick the right one.

They also have an industry collaborator. And the collaborator also wants to identify the right type of catalyst, which can be scaled up to an industrial problem, an industrial reactor. So what are the considerations that you think the industrial collaborator should keep in mind?

The industrial collaborator will definitely need [? very less ?] variance because they don't want to have very high variance among the output of the Excel-- sorry, among the output of the experimental data. And of course, the catalyst should work in their reactor conditions, and so on.

Now, let's look at these two catalysts. The maximum of ruthenium, for example, is much higher than the maximum of nickel. So if I were to just look at the maximum yield percentage, if I had to just pick a value based on the maximum yield percentage of one trial that I got, that would be very wrong. I would be picking ruthenium, which is not right. That is not the right way to choose the catalyst.

So that's the reason we have to look at the replicates. That's the reason we have to look at the number of applicants that you are taking. And now, using all the replicates information, we have identified that nickel is perhaps the best catalyst.

, Now when you suggest to your industry collaborator that I have done all these experiments and I think nickel has worked very well-- it is giving you a unique percentage close to 92%. And you have to choose nickel in your experimental-- in your reactors. So the next question they ask you is, look, I want to have yield percentage of 90% or higher.

So our industrial reactors cannot work with yields of 85% or 87%. We want to have reactors to operate at the yields of 90% or higher. What is the probability that this particular catalyst has the yield percentage of at least 90%?

So what type of experiment will you carry out to answer them? Definitely, what you have to do is you will have to conduct at least-- now, we have zeroed down on the catalyst type. And you have to conduct n number of experiments, sometimes ranging from 50 to 100. When you're looking at giving them a reasonable answer of how reliable it is when I say 90% or more, you need to have a lot of experiments done.

Say we're going to do about 50 experiments using nickel alone. And we are going to have-- we want to study how this distribution looks like. So we've done 15 experiments using just one catalyst. And this is the data that we're getting. For example, we have somewhere around 82, we have somewhere around 100, and so on. This is a scatter plot of the experiment that we just conducted.

Now, using this, we have to specify to them how many percentage-- or what is the percentage number of times, or what is the probability that your catalyst will yield at least 90% or more. So how do you compute this? So you can do this fairly simply using what is known as a Distribution Fitter app in MATLAB.

But before that, if I just look at this, I think some of these are outliers. So I want to do some basic pre-processing even before I get to the Distribution Fitter app. So to do a basic type of pre-processing, you can go to MATLAB Live Editor in the Insert tab. There is something known as Live Task.

In this Live Task, you can do basic data pre-processing in these sections. So you can consider this as an interactive task that you can setup. So here, you have tasks for cleaning missing data, or interpolating, and cleaning outlier data, and so on. So for this particular experiment, I'm looking at cleaning outlier data.

So I can choose this to insert a Live Task. So here, this section pops up, where I'm going to clean my outlier data. And I'm going to store this in a variable called cleandata. And I'm going to specify all this yield percentage, my y-axis, which is in nickel yield, as the input data. And we are going to use this Live Task to clean our data.

So you can specify the type of cleaning method. It could be either filling outliers, or just removing all the outliers. If you want to fill outliers, you can specify what type of interpolation you want to carry out. It could be linear interpolation, or spline interpolation, and so on. You can specify that based on your data pre-processing technique.

So you can do this where now, in my experiment, I've identified that there were three outliers. They're marked here as x's. You can see there's one more somewhere over here. So these three outliers can be removed. And I can save-- or I can fill these three outliers using a spline interpolation technique, and then save it in my Clean Data folder-- of cleandata variable.

Now, this particular clean data, I want to study the distribution of this. So if I want to study the distribution, I'm going to call the Distribution Fitter. Either I can call it command line, or I can call it in the App section here. You can go to Apps, and then go here. You should be able to see the Distribution Fitter in the [? Maps, ?] Tabs, and Optimization subsection.

So let's see how to fit this data and answer to our industry collaborative, how many times-- or what is the probability that our catalyst is yielding above 90%? So first and foremost, this is how the Distribution Fitter app looks like. And we need to first load the data. What is our data here? It's the clean data. So I'm going to select this tab. And I'm going to choose my cleandata variable.

And automatically, it's showing me some preview. It's automatically deciding the number of bins. And it's automatically plotting a distribution function for me-- or distribution fit for me. So here, you can see my data is now grouped into various bins. And we are seeing what is the probability density of that grouping.

And let's see what type of fit is best possible over here. So I'm going to create a new fit. When I specify this new fit, I can create this using-- yeah, so now, let's see. There are various types of fits that you can-- distribution fits you can specify. So here, you can see there is an exponential fit, which is not essentially a distribution fit. Anyway, there is a gamma type of fit, and then there's normal distribution fit, and so on.

If you observe this data, you can see that this is somewhat skewed to the right. So having a distribution fit which does not take into account the skewness is not a good idea. So let's see two types of fit and analyze which one is the right distribution fit for you. So let's try out this normal distribution fit. And then I'm going to click Apply. And here, you can see we have a distribution fit fitted. The red line is the distribution fit.

And then you can also analyze what are the parameters, the fit parameters. So in this particular distribution fit, you can see that the covariance for your mu value and the covariance for your sigma, sigma covariance, is also quite less. So you can see here, the standard data for your mu and sigma is way less than your normal distribution.

Let's see if we can get something lesser. Let's try out gamma distribution, perhaps, and click Apply. So what type of fit is this? Is this good? The covariance number is very large here. The standard error also seems to be very large. So I think normal distribution is a better distribution fit. So we are going to stick with normal distribution, click Apply. And then we are going to answer to our industry collaborators based on the normal distribution fit that we have.

Now, I'm going to close this session-- sorry, close the fit part of it. And let's go to what's in this Manage Fit. In this Manage Fit, you can basically see what type of fits you have, what's the distribution, and also the confidence bounds, and so on. Let's evaluate this particular distribution fit. So I have this fit, which is over here. So let's see how the distribution fit, or the cumulative probability looks like.

So if you can see here, this is the fit result of my f of x. So you can visualize various types of-- either it could be just the density, the probability density function, or you can have the cumulative distribution as well. So you can see here, between 90 and above, you will have all these probabilities. So if I want to answer to my collaborator that whether my yield percentage is at least 90%, what is the probability that my yield percentage is at least 90%? I need to add up all these probabilities.

Alternately, I can also choose the Cumulative Probability Distribution function. And I can choose this x bound to 100 because 110 doesn't really make sense. 110% yield doesn't make sense. So I'm going to choose between 90 and 100. So you can see this particular value is 0.8805. This is the plot for that.

So what is the significance of this particular plot? This plot is nothing but, between 90 to 100, this is the probability. So basically, the area under this graph gives me the cumulative probability, all the y-values over here. Sorry, the area under the graph for the standard distribution function, if you have the cumulative distribution function, then this particular y-value here.

So that is my final answer. That is 88%. I can be sure that 88% of the time nickel will be the right choice of catalyst for you, for getting you at least 90% yield. That is the answer. And we got that using MATLAB. Fairly simple, we've studied somewhat how to do distribution fitting in MATLAB and so on.

Now, let's get back to our presentation and see, in MATLAB, there are a number of types of plots. So far, we have seen three types of plots. Given the time duration we have for this webinar, we couldn't cover a lot of types of plots that are available in MATLAB. However, this particular link here can take you to at least, I think, 40 or more types of plots that are there in MATLAB. So you can please take a look at this link.

There are a few of these that I picked out that you might be interested are surface plots, mesh plots, quiver plots, stair steps. All of these are pretty much, I think, what I have used. So I figured this might be of relevance to you.

So we've discussed a lot about experimental design and visualizations today. As a summary, I would like to tell you a few best practices for experimental design and visualization based out of my own experience. So first and foremost, design your experiments appropriately. If you have the right experimental design, that is almost half the experiment done.

Then you need to have replicates, appropriate controls, everything. You think about your experiment in its entirety. And then when you go on to plotting and visualizing these, typically what is recommended is most of the publications request for-- or have this requirement in their guide of using sans serif typeface.

What is sans serif? Basically, all these serifs, if you see this particular line, I have written it in Times New Roman. I think this is somewhat standard in all research labs. You will see everybody writes in Times New Roman. All the papers and publications are written like this because [? edges ?] over here, they are quite useful to guide your eye in that line. And they are useful when you're reading papers. It's very useful when you're reading lots of text.

And sans serif, on the other hand, doesn't have these edges, the sharp edges that you can see. It doesn't have them. It's very rounded. And this type of typeface is specifically are used for making reading easier.

So typically, plots request you to try Arial, Calibri, Helvetica, different types of fonts which are sans serif in nature. And font size labels-- typically, the font points I should be between 6 to 9 at least. This I got from a science guide.

And one important thing that most researchers need is exporting your figures. You need to export your figures in a format which can support high resolution. And typically, it can be exported in EPS format, which is a vectorized format, which can support good resolution. And MATLAB allows you to export it into EPS format fairly easily.

The main takeaways for today's webinar-- thank you for staying for so long. And out of this entire webinar, if you didn't understand some parts of the demos, that's totally fine. If you can remember these points, this is more than enough. So statistical computations using any type of data sets-- large data sets, small data sets, any type of data sets-- is less time consuming using MATLAB.

So you can work with data sets fairly easily. MATLAB has inbuilt vectorization, can avoid for loops, and so on. And MATLAB itself is represented as a matrix. MATLAB is built on a fundamental unit of matrix. And any data that is stored in matrices can be read in a jiffy in MATLAB.

Second, if you're working with multiple files, sometimes your instruments can give out maybe hundreds of Excel sheets at a time. And working with Excel sheets in, say, hundreds of those is very, very difficult. You need to automate your scripts. You need to write some code to pull all this data, run some basic analysis. And you need to automate these scripts, pull this data from various locations, and so on. And in MATLAB, scripting is easier for you.

And sometimes, when you're working with collaborators especially and you have these hundreds of Excel sheets, you need to have some form of documentation to see what you're doing throughout your data analysis. And this type of documentation sometimes probably lives in separate Word documents, or some other separate format entirely, or a separate platform entirely sometimes. But in MATLAB, you can have one platform to provide all this-- the coding, as well as the documentation, and so on-- using just the live scripts alone.

And then finally, you will need a tool which is easy and gives you a lot flexibility to define custom properties and figures. And in MATLAB, you can use the Property Inspector for point-and-click interface. Or you can write manually lines of code to change the properties of your figures. And this is very, very simple in MATLAB. And we just saw how that can be done. And more than anything else, all your data collection, analysis, visualization, everything stays on one single platform, which is MATLAB.

So thank you so much for staying till now. I will look up a few questions, if they're unanswered, and answer them now. But before the, next webinar is on advanced statistics. So you'll be learning more about hypothesis testing and what are the different types of optimization techniques that you can use, design of experiments, all of this. If you're interested, please join us next time. And it's on October 28.

And we have this webinar series ongoing until November 23. We have topics in bringing in basic image processing to data science, machine learning, and deep learning as well. So looking forward to seeing you there. And I'm sure all of you are already aware MATLAB is not just a desktop platform anymore. It is also available online. And also, there is a cloud location for storing all your MATLAB files called MATLAB Drive. Without having to install or download anything you can access the latest version of MATLAB on any device using MATLAB Online.

And finally, if you are interested to get started or are motivated to learn more about this based on what you saw today, please get started with MATLAB Onramp. This is a free two-hour course which runs in the browser, provides you hands-on practice, and, of course, a completion certificate at the end.

And once you've completed your MATLAB Onramp, go to this particular course. This is a fairly involved course, 17 to 21 hours, of content which is for MATLAB for data processing and visualization. I request you to take this course. It might be very useful to you to understand the nitty gritty of how you can do your data processing and visualization.

And if you are not from a campus-wide license institute and you want to know whether you have access to MATLAB, you can check on this particular portal for knowing whether your institute is on a campus-wide license. If you want to talk to us more about your research and get some advice from MATLAB staff, please reach out to us. There are a few-- you can reach out to us at the email IDs that are specified over here. So we'll be happy to talk to you further and help you out in your own research.

Featured Product

#### MATLAB

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)

### Asia Pacific

- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)