This example shows how to feed an image to a convolutional neural network and display the activations of different layers of the network. Examine the activations and discover which features the network learns by comparing areas of activation with the original image. Find out that channels in earlier layers learn simple features like color and edges, while channels in the deeper layers learn complex features like eyes. Identifying features in this way can help you understand what the network has learned.
The example requires Neural Network Toolbox™, Image Processing Toolbox™ and the Neural Network Toolbox™ Model for AlexNet Network.
Load the popular pretrained network AlexNet. If Neural Network Toolbox™ Model for AlexNet Network is not installed, then the software provides a download link.
net = alexnet;
Read and show an image. Save its size for future use.
im = imread(fullfile(matlabroot,'examples','nnet','face.jpg')); imshow(im) imgSize = size(im); imgSize = imgSize(1:2);
Display the layers of the network to see which layers you can look at. The convolutional layers perform convolutions with learnable parameters. The network learns to identify useful features, often with one feature per channel. Observe that the first convolutional layer has 96 channels.
ans = 25x1 Layer array with layers: 1 'data' Image Input 227x227x3 images with 'zerocenter' normalization 2 'conv1' Convolution 96 11x11x3 convolutions with stride [4 4] and padding [0 0] 3 'relu1' ReLU ReLU 4 'norm1' Cross Channel Normalization cross channel normalization with 5 channels per element 5 'pool1' Max Pooling 3x3 max pooling with stride [2 2] and padding [0 0] 6 'conv2' Convolution 256 5x5x48 convolutions with stride [1 1] and padding [2 2] 7 'relu2' ReLU ReLU 8 'norm2' Cross Channel Normalization cross channel normalization with 5 channels per element 9 'pool2' Max Pooling 3x3 max pooling with stride [2 2] and padding [0 0] 10 'conv3' Convolution 384 3x3x256 convolutions with stride [1 1] and padding [1 1] 11 'relu3' ReLU ReLU 12 'conv4' Convolution 384 3x3x192 convolutions with stride [1 1] and padding [1 1] 13 'relu4' ReLU ReLU 14 'conv5' Convolution 256 3x3x192 convolutions with stride [1 1] and padding [1 1] 15 'relu5' ReLU ReLU 16 'pool5' Max Pooling 3x3 max pooling with stride [2 2] and padding [0 0] 17 'fc6' Fully Connected 4096 fully connected layer 18 'relu6' ReLU ReLU 19 'drop6' Dropout 50% dropout 20 'fc7' Fully Connected 4096 fully connected layer 21 'relu7' ReLU ReLU 22 'drop7' Dropout 50% dropout 23 'fc8' Fully Connected 1000 fully connected layer 24 'prob' Softmax softmax 25 'output' Classification Output crossentropyex with 'tench', 'goldfish', and 998 other classes
The Image Input layer specifies the input size. You can resize the image before passing it through the network, but the network also can process larger images. If you feed the network larger images, the activations also become larger. However, since the network is trained on images of size 227-by-227, it is not trained to recognize objects or features larger than that size.
Investigate features by observing which areas in the convolutional layers activate on an image and comparing with the corresponding areas in the original images. Each layer of a convolutional neural network consists of many 2-D arrays called channels. Pass the image through the network and examine the output activations of the
act1 = activations(net,im,'conv1','OutputAs','channels');
The activations are returned as a 3-D array, with the third dimension indexing the channel on the
conv1 layer. To show these activations using the
montage function, reshape the array to 4-D. The third dimension in the input to
montage represents the image color. Set the third dimension to have size 1 because the activations do not have color. The fourth dimension indexes the channel.
sz = size(act1); act1 = reshape(act1,[sz(1) sz(2) 1 sz(3)]);
Now you can show the activations. Each activation can take any value, so normalize the output using
mat2gray. All activations are scaled so that the minimum activation is 0 and the maximum is 1. Display a montage of the 96 images on an 8-by-12 grid, one for each channel in the layer.
Each square in the montage of activations is the output of a channel in the
conv1 layer. White pixels represent strong positive activations and black pixels represent strong negative activations. A channel that is mostly gray does not activate as strongly on the input image. The position of a pixel in the activation of a channel corresponds to the same position in the original image. A white pixel at some location in a channel indicates that the channel is strongly activated at that position.
Resize the activations in channel 32 to have the same size as the original image and display the activations.
act1ch32 = act1(:,:,:,32); act1ch32 = mat2gray(act1ch32); act1ch32 = imresize(act1ch32,imgSize); imshowpair(im,act1ch32,'montage')
You can see that this channel activates on red pixels, because the whiter pixels in the channel correspond to red areas in the original image.
You also can try to find interesting channels by programmatically investigating channels with large activations. Find the channel with the largest activation using the
max function, resize, and show the activations.
[maxValue,maxValueIndex] = max(max(max(act1))); act1chMax = act1(:,:,:,maxValueIndex); act1chMax = mat2gray(act1chMax); act1chMax = imresize(act1chMax,imgSize); imshowpair(im,act1chMax,'montage')
Compare to the original image and notice that this channel activates on edges. It activates positively on light left/dark right edges, and negatively on dark left/light right edges.
Most convolutional neural networks learn to detect features like color and edges in their first convolutional layer. In deeper convolutional layers, the network learns to detect more complicated features. Later layers build up their features by combining features of earlier layers. Investigate the
conv5 layer in the same way as the
conv1 layer. Calculate, reshape, and show the activations in a montage.
act5 = activations(net,im,'conv5','OutputAs','channels'); sz = size(act5); act5 = reshape(act5,[sz(1) sz(2) 1 sz(3)]); montage(imresize(mat2gray(act5),[48 48]))
There are too many images to investigate in detail, so focus on some of the more interesting ones. Display the strongest activation in the
[maxValue5,maxValueIndex5] = max(max(max(act5))); act5chMax = act5(:,:,:,maxValueIndex5); imshow(imresize(mat2gray(act5chMax),imgSize))
In this case, the maximum activation channel is not as interesting for detailed features as some others, and shows strong negative (dark) as well as positive (light) activation. This channel is possibly focusing on faces.
In the montage of all channels, there are channels that might be activating on eyes. Investigate channels 3 and 5 further.
Many of the channels contain areas of activation that are both light and dark. These are positive and negative activations, respectively. However, only the positive activations are used because of the rectified linear unit (ReLU) that follows the
conv5 layer. To investigate only positive activations, repeat the analysis to visualize the activations of the
act5relu = activations(net,im,'relu5','OutputAs','channels'); sz = size(act5relu); act5relu = reshape(act5relu,[sz(1) sz(2) 1 sz(3)]); montage(imresize(mat2gray(act5relu(:,:,:,[3 5])),imgSize))
Compared to the activations of the
conv5 layer, the activations of the
relu5 layer clearly pinpoint areas of the image that have strong facial features.
Check whether channels 3 and 5 of the
relu5 layer activate on eyes. Input a new image with one closed eye to the network and compare the resulting activations with the activations of the original image.
Read and show the image with one closed eye and compute the activations of the
imClosed = imread(fullfile(matlabroot,'examples','nnet','face-eye-closed.jpg')); imshow(imClosed) act5Closed = activations(net,imClosed,'relu5','OutputAs','channels'); sz = size(act5Closed); act5Closed = reshape(act5Closed,[sz(1),sz(2),1,sz(3)]);
Plot the images and activations in one figure.
channelsClosed = repmat(imresize(mat2gray(act5Closed(:,:,:,[3 5])),imgSize),[1 1 3]); channelsOpen = repmat(imresize(mat2gray(act5relu(:,:,:,[3 5])),imgSize),[1 1 3]); montage(cat(4,im,channelsOpen*255,imClosed,channelsClosed*255)); title('Input Image, Channel 3, Channel 5');
You can see from the activations that both channels 3 and 5 activate on individual eyes, and to some degree also on the area around the mouth.
The network has never been told to learn about eyes, but it has learned that eyes are a useful feature to distinguish between classes of images. Previous machine learning approaches often manually designed features specific to the problem, but these deep convolutional networks can learn useful features for themselves. For example, learning to identify eyes could help the network distinguish between a leopard and a leopard print rug.