| Neural Network Toolbox |
|
| | Provide feedback about this page |
Clustering Data
Clustering data is another excellent application for neural networks. This process involves grouping data by similarity. For example, you might perform:
- Market segmentation by grouping people according to their buying patterns
- Data mining by partitioning data into related subsets
- Bioinformatic analysis by grouping genes with related expression patterns
Suppose that you want to cluster flower types according to petal length, petal width, sepal length, and sepal width [MuAh94]. You have 150 example cases for which you have these four measurements.
As with function fitting and pattern recognition, there are three ways to solve this problem:
Defining a Problem
To define a clustering problem, simply arrange Q input vectors to be clustered as columns in an input matrix. For instance, you might want to cluster this set of 10 two-element vectors:
The next section demonstrates how to train a network from the command line, after you have defined the problem.
Using Command-Line Functions
- Use the flower data set as an example. The iris data set consists of 150 four-element input vectors.
- Load the data as follows:
load iris_dataset
This data set consists of input vectors and target vectors. However, you only need the input vectors for clustering.
- Create a network. For this example, you use a self-organizing map (SOM). This network has one layer, with the neurons organized in a grid. (For more information, see Self-Organizing Feature Maps.) When creating the network, you specify the number of rows and columns in the grid:
net = newsom(irisInputs,[6,6]);
- Train the network. The SOM network uses the default batch SOM algorithm for training.
net=train(net,irisInputs);
- During training, the training window opens and displays the training progress. To interrupt training at any point, click Stop Training.

- For SOM training, the weight vector associated with each neuron moves to become the center of a cluster of input vectors. In addition, neurons that are adjacent to each other in the topology should also move close to each other in the input space. The default topology is hexagonal; to view it, click SOM Topology from the network training window.

- In this figure, each of the hexagons represents a neuron. The grid is 6-by-6, so there are a total of 36 neurons in this network. There are four elements in each input vector, so the input space is four-dimensional. The weight vectors (cluster centers) fall within this space.
Because this SOM has a two-dimensional topology, you can visualize in two dimensions the relationships among the four-dimensional cluster centers. One visualization tool for the SOM is the weight distance matrix (also called the U-matrix).
- To view the U-matrix, click SOM Neighbor Distances in the training window.

- In this figure, the blue hexagons represent the neurons. The red lines connect neighboring neurons. The colors in the regions containing the red lines indicate the distances between neurons. The darker colors represent larger distances, and the lighter colors represent smaller distances.
A band of dark segments crosses from the lower-center region to the upper-right region. The SOM network appears to have clustered the flowers into two distinct groups.
To get more experience in command-line operations, try some of these tasks:
Using the Neural Network Toolbox Clustering Tool GUI
- Open the Neural Network Toolbox Clustering Tool window with this command:
nctool

- Click Next. The Select Data window appears.
- Click Load Example Data Set. The Clustering Data Set Chooser window appears.
- In this window, select Simple Clusters, and click Import. You return to the Select Data window.
- Click Next to continue to the Network Size window, shown in the following figure.
- The size of the two-dimensional map is set to
10. This map represents one side of a two-dimensional grid. The total number of neurons is 100. You can change this number in another run if you want.
- Click Next. The Train Network window appears.
- Click Train
- The training runs for the maximum number of epochs, which is 200.
- Investigate some of the visualization tools for the SOM. Under the Plots pane, click SOM Sample Hits.
- This figure shows how many of the training data are associated with each of the neurons (cluster centers). The topology is a 10-by-10 grid, so there are 100 neurons. The maximum number of hits associated with any neuron is 22. Thus, there are 22 input vectors in that cluster.
- You can also visualize the SOM by displaying weight places (also referred to as component planes). Click SOM Weight Planes in the Neural Network Toolbox Clustering Tool.

- This figure shows a weight plane for each element of the input vector (two, in this case). They are visualizations of the weights that connect each input to each of the neurons. (Darker colors represent larger weights.) If the connection patterns of two inputs were very similar, you can assume that the inputs are highly correlated. In this case, input 1 has connections that are very different than those of input 2.
- In the Neural Network Toolbox Clustering Tool, click Next to evaluate the network.

- At this point you can test the network against new data.
If you are dissatisfied with the network's performance on the original or new data, you can increase the number of neurons, or perhaps get a larger training data set.
- When you are satisfied with the network performance, click Next.
- Use the buttons on this screen to save your results.
- You now have the network saved as
net1 in the workspace. You can perform additional tests on it, or put it to work on new inputs, using the function sim.
- If you click Generate M-File, the tool creates an M-file, with commands that recreate the steps that you have just performed from the command line. Generating an M-file is a good way to learn how to use the command-line operations of the Neural Network Toolbox software.
- When you have saved your results, click Finish.
| | Provide feedback about this page |
 | Recognizing Patterns | | Neuron Model and Network Architectures |  |