By the end of this talk, my hope is that you will get more than an overview of machine learning—you’ll have an understanding of the many machine learning algorithms available in MATLAB and you will see how MATLAB is a very good interactive environment for evaluating and choosing the best machine learning algorithm for your problem.

Why should we care about machine learning? The discipline of data analytics, which we’ll talk about in the next slide, is actually driven by machine learning. In machine learning, one trains an algorithm with initial data and then uses that resulting model or knowledge to predict outcomes or classify new information.

So, what is data analytics, and why is machine learning critical in this area? Data analytics solutions let organizations improve business outcomes by allowing them to identify patterns and build predictive models around them. The data involved may come from multiple sources. It may be streamed from sensors in a car or agricultural equipment. It may be read from a database or some sort of a file.

Data analytics has some key challenges: One may be overloaded with data, lots of observations, lots of variables, right? Now, one may want to use the information in the data to aid in decision making. When we want to make sense of the data using analytics and then integrate that knowledge with enterprise-wide systems for decision making, machine learning is the advanced analytics tool that lets you extract intrinsic value out of your data.

Machine learning has application on a variety of different industries, ranging from financial services to pharmaceutical. Applications vary from tumor reduction and drug discovery to recognizing patterns in images, videos, or audio data. Anywhere there is a need to predict something, machine learning perhaps is suitable.

There are some inherent challenges associated with machine learning, though. It’s a very complex field of study. There are so many predictors, so many algorithms, where should I even start? Which one is the most suitable for my needs? Typically, machine learning may entail black-box modeling. Is that suitable for my application? Do I have the technical know-how to deal with it? And most important of all, it may take numerous iterations before you find a suitable model, or you may be working with large data sets, which take a lot of time to process. All these things have an impact on how long it takes for you to get to the results. Today, I want to highlight how MATLAB can be used to address all of these challenges.

MATLAB not only provides an interactive environment to visually explore the data and gain insight, but along the way, you might identify some very interesting questions to ask off of your data. There are apps that help you explore the impact of different models on your data. The framework makes it easy to create all different algorithms and measure their accuracy. One can also incorporate parallel computing paradigms to speed up computations. And not only that, once you’re happy with the models, you can integrate them in your larger data analytics and decision-making workflow.

Machine learning can be branched into unsupervised learning and supervised learning. Now, in the case of unsupervised learning, you group your data based on some similarity or characteristic. You may not have *a priori* knowledge about the groupings themselves. There are several clustering techniques available to help us with this endeavor.

In the case of supervised learning, one creates a model to predict a quantity of interest. We’ll call it output, or response variable. You have an expectation of what factors—called inputs or predictors—will have an impact on response, even though you may not have an understanding of the exact relationship. You would use the predictors and response data to train your model and use the model to make predictions, given only the input data.

If your response is discrete in nature, for instance, if you’re trying to predict tumors—tumor size into small, medium, or large—then it is a classification problem. If your response is continuous in nature, for instance, if you’re trying to predict electricity demand from a grid in kilowatt hours, then it is a regression problem. There are a variety of clustering algorithms available in MATLAB. Today, we’ll cover both partitional clustering techniques as well as overlapping clustering techniques.

Partitional clustering techniques assign each data point into a single cluster. For instance, K-means, hierarchical, self-organizing maps, etc. On the other hand, in case of overlapping clustering techniques, a data point may simultaneously belong to more than one cluster. It computes the probability of a data point belonging to a particular cluster. For instance, Gaussian mixture models, fuzzy C-means, etc. In our example, we’ll cover many of these techniques.

MATLAB has several algorithms for linear, nonlinear, and nonparametric regression, including neural networks, boosted and bagged decision trees. All of these techniques can be used to solve a classification problem as well. Additionally, some other algorithms are available for solving a classification problem. We will begin by solving a classification problem, leveraging many of the algorithms listed here.

Before we jump into our example, let us look at a typical workflow for supervised machine learning. You begin by importing the data, preprocessing it, and cleaning it up. You may spend some time exploring the data, perhaps creating some visualizations, and then find your variables of interest. In the next step, you would select a model and train the model on the data. Then, you would measure the accuracy of the model for comparing different models, as we will create. You may then select another algorithm and repeat the process. This step can be very iterative in nature. Once you select the best model, you can use it to make predictions on the new data set.

As we are moving through this workflow, we’ll look for opportunities to speed up our computations. Many of these tasks in the workflow can be computationally intensive.

Classification lets you predict the best group a new observation belongs to. In this example, we have data from a marketing campaign conducted by a bank. They were selling a term deposit product; they collected information around each contact with the customer. This included customer information such as their age, marital status, etc., and campaign information such as how and when they were contacted, how long did the calls last, and so on. All these will be used as inputs to our model. Each observation was also labeled such as, yes, they bought the product, or no, they did not buy the product in the end. This is what we want to classify.

Classification lets you accurately group data you’ve never seen before. When we are done, MATLAB will produce a report like this. I’ve added comments describing all the steps of my analysis. The code that is required is also embedded in this report. Any visualizations that were created are also part of this report.

In this case, we have our data in a CSV file. MATLAB has a lot of interactive tools to read in information from various file formats. Here, MATLAB figures out that this is a spreadsheet and has launched a very intuitive tool for me to work with it. Notice that my data contains both numeric information as well as text. Tables are an excellent container to work with heterogeneous data. I can bring in this data into my workspace with the click of a button.

In fact, this app, many other apps I will show you, and most other apps I will not get a chance to show you, have this capability of auto-generating code. I can ask MATLAB to give me this code that I can use to automate all the steps of the analysis. In fact, if I want to teach myself a bit of MATLAB, I can do that as well. I’m not going to save this file, as I’ve already done that.

Now, I can look at the data itself, but I personally like to look at charts, visualize the information. You can see right here that MATLAB has this capability to create simple line charts, bar charts, histograms, 2D graphics, 3D graphics, with the click of a button. Let’s plot a couple of variables—let’s say age and bank balance—with respect to whether people bought our product or not.

Here, I can get some insight into the data itself as I’m creating these visualizations. Perhaps I can ask myself a question, do people with higher bank balances not tend to buy our product? Maybe it’s something for us to investigate more. I can create these various visualizations to build my intuition around the data as well, and notice as I’m interacting with any of these components on the desktop, MATLAB is giving me the commands to help me learn some of those things.

Now, the information here in this variable contains both my predictors as well as my responses. My response variable that I’m trying to predict or classify, in this case, is in this last column, labeled “Y.” I can pull that information out and make a vector *Y*, and I’m going to create a matrix *X*, which will contain all the information about my predictors. The functions that I’ll subsequently use are going to make use of that data. Now, instead of typing everything here, I’m going to open up a script where I’ve done all the work.

I’m going to execute this code section by section. Here in this first section, I’m making use of this auto-generated function to pull in this information from the file. In this next step is where I’m creating, breaking up my data into a test and training set. And I’m also making use of these functions like cvpartition that allow me to break my data into test set and training set. The training set will be used to train the models, calibrate parameters of the models, and the test set will be used for doing auto-sample reporting of the models for comparison purposes.

I can also look at the tabulated results and I can see that this data is pretty skewed: 90% of the people have chosen not to buy the product. About 10–12% of the people have actually bought the product. And you can see, when I created my training set and test set, I have similar counts for the data. Since I have this data, I’m going to make use of this app to get started. I will start by building a neural-network-based model. There are a bunch of apps available in MATLAB to help me explore the depths of different toolboxes, libraries, that come along with the tools.

I’m going to start out with the neural networks apps. Neural network has a clustering app, regression app, as well as an app for me to help classify my data. This pattern recognition app does that for me. I can provide it my inputs, my predictors, my outputs, or my targets, or my response; these different observations are along the different rows. These algorithms further break down the data into training, validation, and testing sets. I’m going to go with the defaults, here. I’ll create a one-hidden-layer, 10-neuron network, again, sticking with the defaults.

Once I’m happy with my selections, I can train my network. In fact, as the training is going on, I can look at the changes, how it’s making progress, while the training is happening. I can wait for the training to complete, but here, I’m going to just stop the training in between. If I was not happy with the results, then I would retrain the network. If I was happy with the results, I can actually create functions to integrate with, let’s say, my Simulink model, or a larger application that I was trying to deploy. I can even generate a function to look at the weights and biases of my network. I’m not quite ready, so I’m going to move along.

Here, I can save my results to my workspace. I can even generate code to redo this work. Again, it’ll help me with the automation part. I can look at the code the way MATLAB is setting all those options, pretty much the things that I’d seen in those different screens.

I can look at this function in more detail in the documentation. This is a very typical documentation page with brief calling syntaxes, more elaborate description, the different training algorithms, and these different options. One of the options that I want to highlight here is this useParallel and the useGPU flag that you can set. These flags allow me to throw more hardware resources, if they are available, on the problem itself. What I did next in this particular script was I created a function out of the code that was auto-generated, and I made this little change where I set up my useParallel flag. I can execute the code, here. It’s going to speed up because it’s making use of multiple workers that are running on my machine.

Once I have the trained network, I can make it to use predictions. You would notice that I can make these networks return a score. These scores can be used to create the predictions. Here, I’m making a simple assumption that I’m rounding the scores. Anything about 0.5 is going to be marked as people who are going to say yes to my out-of-sample data set. Otherwise, they’re going to say no. And I can see that in 90% of the cases, I have it correct.

Perhaps I want to look at the confusion matrix, look individually, how am I doing there? This clearly shows that I’m able to predict the people who are saying no very, very well. However, the people who are saying yes, I’m doing a fairly poor job—only 31% of the time, I have that correct. If I was the person who was designing this marketing campaign, I would perhaps be more interested in the people who will say yes.

But why round the scores? Why use 0.5 as the threshold? Why not use 0.1 as the threshold? What happens if I do that? Wow. Quickly, I notice that 85% of the times, now, I’m able to correctly predict the people who are going to say yes. Of course, it comes at a cost -- my prediction rate for the people who are going to say no goes down. Why even 0.1? I can generate these receiver-operating characteristics. This function curve here does all that, tries different thresholds, and generates this curve for me. I have the true positive rate versus the false positive rate on this chart. Now, it’s up to me how I make use of this chart to pick up the different thresholds.

Let’s say I was designing a bank marketing campaign which was fairly cheap. I just want to make sure that I get everybody who’s likely to say yes. I could pick up a point on this chart where I get 90% or 95% of the people who are likely to say yes, and of course, only 30% of the cases, I would get it wrong. On the other hand, if I was designing a marketing campaign which was very expensive in nature and I did not want to get anything wrong, I would pick up a threshold right here on this curve. I’ll get 40% of the people who are likely to say yes to my request, and I’ll get it only 5% of the times, perhaps, wrong. So, it depends on me, how I make use of these charts.

Now, as I write code here, MATLAB produces these context-sensitive tabs. And I can publish this code into a report. By default, MATLAB is going to generate an HTML file. I could have as easily generated a PDF or a Word document here. Notice my comments have turned into formatted text. MATLAB has automatically introduced a table of contents for me, and everything else, part of the analysis, has become a part of this report. And you can see how I’ve automated all the steps of the workflow here, from pulling in information to doing the analysis to auto-generating a report which I can share with other people.

Now, after this, I went ahead and tried all the different classification techniques that I had shown you on the slides. Instead of running through them here, I will open up a report where I’ve done all the work.

Notice that after neural networks, I tried all these different machine learning techniques, from logistic regression models to decision trees and bagged ensemble algorithms. The philosophy that I applied going through these was exactly the same. Let’s discuss one of these in more detail.

Let’s say we look at discriminant analysis. How do I even find out what are the different functions available to help me with this machine learning task? This is, again, a good time to go towards the documentation. I can search for things, and if I know where to look for, I can find this whole section dedicated in the Statistics Toolbox to machine learning. Right now, we are interested in supervised learning techniques, so I can find all the different functions available right here. I can find all the functions, let’s say, associated with discriminant analysis that’ll help me with this task. There are complete, elaborate examples to walk me through this whole workflow as well. Another aspect in the documentation is this page which compares the different algorithms. You’ll find it very helpful when you’re deciding which algorithms to employ based on the data that you’re working with. There are some suggestions to pick in ensemble algorithms as well.

So, here, as I’m applying each of these classification techniques, all I do is find the function to train my data. The data that I prepared earlier can be used as it is, with all the classification algorithms. I train the data, train the model, use that model to make a prediction for my out-of-sample data or my test data, and then generate some sort of a metric that I can utilize to compare the different algorithms. Here, I’m going to be making use of confusion matrices.

I went through and applied all of these different techniques. Wherever available, I even employed parallel computing, the useParallel flag, for instance, with decision trees or ensemble learning. Then, I created this chart to compare my results. I want to quickly visualize all the results. These are all the default predictions from each of these classification models that I utilized, and I can quickly compare them right here. The dark blue bars are essentially corresponding to the people who said no, and the light blue bars are corresponding to the people who said yes. I may be interested more in the light blue bars towards the right side, and of course, you can imagine, if I chose another threshold based on the scores that were returned by each of these techniques, I can move these bars up and down as needed. This is just one snapshot of that performance curve for all of these techniques.

Similarly, I computed and plotted all the receiver operating characteristics, or ROC curves, for all of the techniques. And you can see that different techniques and different regimes are behaving differently. So, for instance, in this regime, neural networks happen to be doing the best, whereas somewhere here, my bagged decision trees are doing better. I’m not forced to pick and choose one of these techniques. Perhaps I can combine the results from these different models and come up with my own classifier, which is better than these. And in the next, that is exactly what I’m doing. I’m combining the scores—I’m taking averages, I’m creating another classifier by looking at the medians, and a voting scheme. And here, I’m looking at the results for that. And in the combined results, I see in many cases I have a better classifier when I make use of the combo median technique for this particular dataset. There are certain regimes where my combo mean is actually doing better. And my voting technique did not really do well in this particular case. Point is, you can try these different techniques and use a combination of them as well, if you wanted. You don’t have to restrict yourself to what you’ve learned in perhaps the books that you’ve been reading.

Lot of people don’t stop here. They even look at what is the impact of the different predictors on the data itself on the models that they are creating? I will not go into too many details, but some of the things that are covered in this particular section are, how many trees should be sufficient for me to create a model that is equally effective when I’m applying these ensemble learning techniques? Perhaps I want to look at the relative importance of all the different predictors that I had in my data. Perhaps I want to create a more parsimonious model by making use of these techniques such as sequential feature selection, and come up with a model which is almost as good as my full model. This will not only help me better with understanding the model that we are creating, but it’s going to be computationally less intensive as well, so I can get to the results faster. You can look at this in more detail at your own leisure.

In this example, we went through a number of different classification techniques available in MATLAB and I made use of the documentation to identify these things and understand what each of these techniques are doing. I made use of these visualizations and apps along the way to help me with my results. And wherever possible, I made use of additional hardware resources by switch setting of a flag.

The biggest benefit that you are getting when you’re making use of MATLAB is that you can focus on the modeling tasks. All these algorithms are already built-in—you don’t have to worry about looking for their accuracy. You have support available if you don’t completely understand what each of these things are doing, or how should they be utilized.

Let’s look at clustering example as well. Clustering lets you segment data into groups based solely on similarities in your data. In the screenshot, this two-dimensional data has been clustered into different groups, identified by different colors. In the next example, we’ll do something similar. Our problem is much more complicated as we’ll work with four or so predictors and many more data points. To begin with, it is hard to even visualize such data.

This example is based on a customer use case. Here, we want to cluster a bunch of bonds together based on some characteristics. We will look at partitional and overlapping clustering techniques. Another key challenge is to decide the number of clusters that naturally exist in the data. We’ll look at some ways to determine that.

So, let’s step through this code again, as we did earlier. I have some data already available that I’ve loaded here. I have some bonds information and a bunch of features associated with those bonds, including credit ratings and coupon rates and other things. Now, notice I have this first column, where I have corporate and municipal bonds. This is a categorical variable. And here, this column eight is rating, which contains credit ratings, where AAA is considered to be a better rating than, let’s say, a CCC. So, there is an order associated within the ratings.

When I have defined some of my variables as categorical and ordinal, I can very easily slice that data. For instance, here, I’m going to only work with corporate bonds which have been rated CC or higher. Now, I can again create different kinds of visualizations to look at the data itself, build some insight. Perhaps I can plot the coupon rate and the yield to maturity along with credit ratings, on a chart. And here, I can quickly see that my higher-rated bonds, like AAAs, have a much lower yield to maturity and a coupon rate versus poorly rated bonds, which are towards the right top corner side of this chart.

Now, instead of using all of the different features, I’m going to just pick out four features from this data set. I’m going to assume that coupon rate, yield to maturity, current yield, and the rating should be sufficient for me to cluster the data. Now, once I’ve prepared my data, this is what the matrix looks like. I have the different observations along the rows and the different features that I want to work with along the columns. Whenever you’re working with any data set, finally, once you have prepared the data, this is what it should look like: a matrix with observations and features.

Now, here, I went ahead and applied both partitional as well as overlapping clustering techniques. So, here, in this case, I applied K-means, hierarchical, and neural networks – self-organizing maps as the partitional clustering techniques. Partitional clustering techniques return the index to which each observation belongs, or the cluster number to which each observation belongs. On the other hand, overlapping clustering techniques return a matrix containing probabilities—the probability each observation belongs to a particular cluster. Here, I applied fuzzy C-means and Gaussian mixture models to this data.

First, I start with K-means. Now, when you’re applying clustering techniques, you have to define the similarity metric—what defines the closeness of the different observations. Typically, it’s also known as the distance metric. So, if I look at the documentation for K-means, you will notice that there are a series of different distance metrics that you can apply. And again, there are a lot of other options, including the useParallel option that I can apply to speed up computations.

I also wrote this helper function, plot bond clusters, that allows me to visualize the data. Now, here, you will notice it creates a three-dimensional visualization. I have just chosen three out of my four features to plot on this particular chart: the coupon rate, the yield to maturity, and the credit ratings. The different colors suggest the different clusters that each data point has been put in.

In a similar way, I can choose the function related to hierarchical clustering and I can cluster the data accordingly. In the case of hierarchical clustering, what MATLAB does is it takes the points which are close to each other and assigns them into a cluster. So, at the beginning, all of the points belong to their own cluster, and then the closer ones are put into another cluster, until all of the points come along and get clustered into a single cluster. Hence the name, “hierarchical clustering.” And the results can be seen here. Again, you can choose different similarity metrics while applying hierarchical clustering as well.

I’ve made use of the neural networks clustering app to auto-generate all of this code that does self-organization maps for clustering this data. And notice, I’m making use of the same function to visualize my results. And I have another result for clustering, making use of SOMs. Neural networks also have something known as competitive layers that can be used for clustering as well.

In the case of overlapping clustering techniques, as I’ve mentioned earlier, it’s going to return—instead of a vector, it’s going to return a matrix—in this particular case, a matrix *U* that contains the probabilities. And I’m making use of fuzzy C-means, here. Now, if I look at *U*, it contains three rows corresponding to the three clusters I had requested and the probability that each observation is going to belong to a particular cluster right here.

The first thing I did was, I just wanted to visualize the data—if I was to partition the data into each of these clusters based on the highest probability, and then use my same function to visualize the results. And I can see, based on the highest probability that a point is going to belong to a cluster, this is what it’s going to look like, based on that CM.

Of course, I may not want to use the results in that sense. I may want to visualize the probability planes themselves, and this visualization allows me to see that. I can use these probabilities depending on my needs. Perhaps I want to use the probability as weights, while I’m pricing these bonds based on the different characteristics of the different clusters. Similarly, I apply Gaussian mixture models. Again, all of these functions, information about these functions, is available in the documentation. And the first thing, again, I did was, based on the highest probabilities, partition the data, and then I can look at, again, the probability planes. Here, I’m looking more at a heat map of the probability planes.

So far, I have been requesting these algorithms to give me three clusters. Do three clusters exist in my data naturally? Are there more or are there less? You can ask any of these algorithms to give you X number of clusters and they will give you, no matter whether that many clusters exist in the data or not. It is the responsibility of the user who will figure out how many clusters naturally exist in the data.

One way here I’m looking at some of the partitional clustering techniques, K-means and hierarchical clustering, they return the distance metric that I used to partition my data between these points. On the top chart, for instance, I have plotted the distance metric associated with hierarchical clustering. I sorted all the points based on the cluster number and the distance metric and plotted them. I would expect that all the points within the same cluster are going to be close to each other, so the distance metric should be small—so, in this particular case, it should be a darker shade. Whereas those same points should be farther away from the points in the other clusters, so the distance metric should be higher. And so, I would expect those points from the other clusters to be brighter in shade. So, I would actually expect a block diagonal sort of a pattern to appear in this chart, and that is what I see for both hierarchical as well as K-means clustering. So, I can visually see whether this many clusters naturally exist in my data or not.

For hierarchical clustering, what I can also do is generate dendrograms. A dendrogram is a plot that allows you to visually see the linkages of the different points in the cluster. These vertical lines are proportional to the distance metric that has been used while the cluster was generated.

Now, I would expect, if there are naturally X number of clusters in the data, I would expect longer lines towards the top and shorter lines towards the bottom of this chart because the points in the same cluster should be close to each other, and then, overall, when you start connecting clusters, points which are far from each other, then the lines should be longer.

Now, wherever I cut this dendrogram is going to determine how many clusters I’m going to get. For instance, on the Y-axis, if I cut the dendrogram around 0.9, I’ll get two clusters, whereas if I cut the dendrogram at 0.7, I’ll perhaps get three clusters. I’ve also put down the cophenetic correlation coefficient at the top of this chart. That is a coefficient that tells me how truly the length of these vertical lines is representative of the similarity metric that I have used. I would like to see this correlation coefficient to be high because that means that these lines are reliably representing the distance metric that I’m utilizing on this chart.

One can also make use of silhouette distances. So, in this particular case, I’ve made use of K-means clustering and computed the silhouette distances for the different number of clusters I was creating. For instance, here, if I make use of two clusters, I can see these points are sorted by the cluster number on this chart and the silhouette values are also ordered. So, the higher the silhouette value, the more confidence you have that this point is going to belong to cluster number one, right here. For these points, where the silhouette value is low, that tells me that these points have been marked by the algorithm into cluster number one, but the confidence isn’t that high. They’re pretty close to the points in the cluster number two, as well. And you can create these charts for as many clusters as you create.

So, next, what I did was I created 2 through 15 clusters and plotted the average silhouette values on this chart. And you can see here that the average values are fairly high for two, three, and four clusters, but they suddenly drop when I create five clusters out of my data. So, that tells me there are perhaps at most four clusters in this data, not more.

There are a few different clustering evaluation metrics available in the documentation. You can utilize these other evaluation metrics as well, to see how many clusters naturally exist in the data that you’re working with.

Again, in this particular case, I went through and applied a bunch of different unsupervised learning techniques and made use of visualizations to quickly understand better the whole analysis. All of these functions that I have been utilizing are actually white box. Essentially, I can open up any of these functions that I have utilized and look underneath the hood, the code, if I wanted. So, you can put break points in the code, step through the code, and exactly see how we have implemented everything. If you like what you see but there are some things that you would want to change, you can always save these functions with a different file name and apply your changes. And I was able to quickly go through these different techniques and get to my results.

So far, I showed you how machine learning in MATLAB lets you overcome challenges of loss of productivity, extracting knowledge from data, and computation speed. We looked at how you can achieve high productivity from data preparation, interactive exploration, and visualization in MATLAB. The machine learning algorithms that are available provide both depth and breadth in classification, clustering, as well as regression techniques.

And you can further improve performance by leveraging parallel computing. Today, I did not get a chance to talk about deployment and enterprise-wide integration. MATLAB makes it possible to use your machine learning models and production via push-button deployment.

Also, by using machine learning solutions provided by MathWorks, you mitigate technology risk by getting access to high-quality libraries, technical support, training, as well as advisory services when needed. There are many machine learning resources available on mathworks.com. To begin, you can visit the machine learning page that has links to videos and introductory examples. Some of these webinars provide more detailed examples on regression, classification, and clustering. We also have several application-specific webinars for machine learning in life sciences, energy forecasting, and financial credit risk modeling.