How to change the outputs of the neural network that the error function receives

6 views (last 30 days)
Given a neural network with 2 output classes there are 2 ways of assigning its real-valued output, M in [0,1], to binary classes.
  1. You have one target class, if M>0.5 assign to class 1. Else, do not (equivalent to assign to class 2).
  2. You have two target classes. If M_{1} > M_{2} assign to class 1, else assign to class 2.
I would like my output to be biased s.t.
  1. if M>0.8 assign to class 1, else class 2.Incidentally, (this is not the main question) would this be equivalent to
  2. If 0.2*M_{1} >0.8*M_{2} assign to class 1, else class 2.
I can change the final outputs in this way. However, I am running the network through some loops and I would like my changes to be recognised within the network's architecture. It is right that the error function minimises some sort of distance (perhaps least squared) between the output, M, and the target, y? In my case it will be minimising the distance between the old, unmodified output, rather than my threshold adjusted output. Do you know how I can tell the error function about this new output? Many thanks for any help

Accepted Answer

Greg Heath
Greg Heath on 29 Oct 2013
The classical approach for c classes is to minimize Bayesian Risk R given prior probabilities Pi (i=1:c), classification costs Cij>= 0, Cii = 0 and input conditional probability densities p(i|x).
R = sum(i=1:c)( Pi*Ri ) % Total Risk
Ri = sum(j=1:c){Cij*p(j|x)} % Risk of ith class being misclassified
Hits from searching classification costs with and w/o greg
comp.ai.neural-nets 49 128
MATLAB Newsgroup 15 32
MATLAB Answers 2 6
Applications to the BioID data. Search using BioID
comp.ai.neural-nets 3
MATLAB Newsgroup 5
MATLAB Answers 1
Hope this helps.
Greg
  3 Comments
Greg Heath
Greg Heath on 1 Nov 2013
Real world classification/pattern-recognition neural networks where different a priori probabilities and misclassification costs for multiple classes/categories can be quantified. Objective functions are Mean-squared-error or Cross-entropy for both exclusive categories (single one and c-1 zeros) and nonexclusive categories (e.g., 0 or 1 for short, fat, bald, ugly).
MATLAB offers error weighting to deal with unequal priors and introduced cross entropy objective functions in 2013b.
For most of the classification problems given to beginning students, priors are assumed to be equal and misclassification costs are assumed to be equal. Consequently, they have no affect on the design. The objective is to minimize the total number of classifications. However, if the classes do not have equal sizes the tendency is for the classifier to sacrifice errors on the smaller classes in order to have better classification on the larger classes. In addition misclassification of smaller classes may have more serious consequences ( e.g., sheep and wolves). That is why the more complete designs using priors and costs are used.
In your case you want to use the weights 0.8 and 0.2. Why? a. You have 4 times as many examples in one class ? b. The errors of one class are 4 times as expensive as errors of the other class ? c. A combination of a and b.
The quantity distance has a precise meaning (see any linear algebra text). It is probably better to use the term difference or dissimilarity.
I suggest that you investigate target weighting
help mse
doc mse
Hope this helps.
Laura
Laura on 4 Nov 2013
Hi Greg,
Thanks, this certainly has been helpful.
The answer to your question is c. A combination of a and b.
I've realised the answer to my question is I don't need to do anything to my error function. Error is calculated from the real valued output, before it is assigned to binary classes, understandably otherwise in my case t-y could only ever = 0 or ±1, not a very helpful thing to minimise.
However, you're taking my question in another direction which is really useful to me. I have actually already been using a cost matrix to deal with my class imbalance problem. But using weights in the error function may be better, as the cost matrix is applied post-training therefore is not useful for testing the network on new data, i.e. it just shifts the final results and doesn't change the network.
Are you suggesting that I may be able to weight the error function? Is there a standard way of doing this?
One problem is, if I understand correctly from reading help mse, mse is only applicable to newff, newcf, or newelm. I have found newff performs worse than patternnet on my data in its current state.
Many thanks.

Sign in to comment.

More Answers (1)

Greg Heath
Greg Heath on 5 Nov 2013
Hi Greg,
%Thanks, this certainly has been helpful.
%The answer to your question is c. A combination of a and b.
%I've realised the answer to my question is I don't need to do anything to my error function. Error is calculated from the real valued output, before it is assigned to binary classes, understandably otherwise in my case t-y could only ever = 0 or ±1, not a very helpful thing to minimise.
I think you meant to say abs(t-y) and 0,1. Although I am sure algorithms exist that minimize binary functions..
Assuming prior probabilities and classification costs are equal, you typically have
1. c exclusive classes and targets are columns of the c-dimensional unit
matrix. Outputs are interpreted as class posterior probabilities,
conditional on the input ( e.g., p(class =i | x) ). The input is assigned
to the class corresponding to the largest output.
a. SOFTMAX output transfer function which guarantees that outputs
have the following probabilistic properties (0,1), end points excluded,
with unity sum.
b. LOGSIG output transfer function which guarantees that outputs
have the following probabilistic properties (0,1), end points excluded.
However, the unit sum is not enforced. This is a moot point as long as
the largest output can be identified. If probability estimates are
required, the outputs are just divided by their sum. However, I don't
remember those estimates being the same as the ones obtained via
SOFTMAX.
c. PURELIN output transfer function which only guarantees that outputs
have a unit sum. Frequently, this is a moot point as long as the largest
output can be identified. However, if probability estimates are desired,
I don't remember finding a satisfactory way to obtain them (e.g., using
softmax or logsig after the fact).
2. c non-exclusive classes and targets are columns containing a mixture
of zeros and ones. Outputs are still interpreted as class posterior
probabilities, conditional on the input ( e.g., p(class =i | x) ). However
the unit sum constraint and SOFTMAX do not apply.
a. Either LOGSIG or PURELIN are used and the input is assigned
to those classes corresponding to the outputs that exceed either 0.5
or a specified class-dependent threshold determined by other means.
If MSE or CROSS-ENTROPY are used, all of the above output transfer
functions yield "consistent" probability estimates (I "think" that means
as N --> inf).
% However, you're taking my question in another direction which is really useful to me. I have actually already been using a cost matrix to deal with my class imbalance problem. But using weights in the error function may be better, as the cost matrix is applied post-training therefore is not useful for testing the network on new data, i.e. it just shifts the final results and doesn't change the network.
No.
Using prior probabilities is the classical way to deal with unbalanced classes. In the BIOID threads I recall testing several approaches. The one that I liked best was to add duplicates (a little added noise probably helps) so that all classes are the same size. After the net is designed, prior probabilities and classification costs can be used to make minimum risk classifications. However, I don't recall using costs.
% Are you suggesting that I may be able to weight the error function? Is there a standard way of doing this?
Check MSE, SSE, MAE and SAE for weighting options. If you have 2013 also check the newly added CROSSENTROPY.
However, I recommend the equal everything training approach I used in BIOID. The effect of priors and costs can be added post training.
% One problem is, if I understand correctly from reading help mse, mse is only applicable to newff, newcf, or newelm.
False. I use it everywhere.
% I have found newff performs worse than patternnet on my data in its current state.
Apples and oranges: Newff is obsolete and uses different defaults. Is there a particular reason why you would use newff for classification instead of the newpr classifier?
Match defaults and the differences should be slight in spite of belonging to different generations.
Proper comparisons are
Obsolete Current Usage
newfit fitnet Regression and Curve-fitting
newpr patternnet Classification and Pattern-Recognition
newff feedforwardnet Basic: Called by the two above.
fitnet and feedforwardnet are identical except fitnet also outputs plotfit ( y vs t )
patternnet and feedforwardnet are identical except patternnet has
1. tansig output instead of purelin
2. trainscg training function instead of trainlm
3. plotconfusion and plotroc instead of plotregression
You can check the differences between the corresponding obsolete functions below. However, for the obsolete functions you have to specify (x,t,H). If you do not specify H, the default H = [], creating a linear model results. (For the current functions, nothing has to be specified at creation: H=10 is the default and (x,t) don't have to be specified until configure or train is invoked).
clear all, clc
[x,t] = simplefit_dataset;
net1 = newff(x,t,5);
net1.layers{1}.transferFcn
net1.layers{2}.transferFcn
net1.trainFcn
net1.plotFcns
net2 = newfit(x,t,5);
net2.layers{1}.transferFcn
net2.layers{2}.transferFcn
net2.trainFcn
net2.plotFcns
[x,t] = simpleclass_dataset;
net = newpr(x,t,5);
net.layers{1}.transferFcn
net.layers{2}.transferFcn
net.trainFcn
net.plotFcns }
It should be noted that the newpr results indicate that the newpr help and doc documentation contain serious errors.
  1 Comment
Ali Meghdadi
Ali Meghdadi on 22 Apr 2020
Edited: Ali Meghdadi on 22 Apr 2020
Hi Greg. Excellent answer. I was wondeing if it is possible to put weights on false positive and false negatives, the same as the predefined misclassification cost array in random forest and SVM?
Misclassification cost, specified as a numeric square matrix, where Cost(i,j) is the cost of classifying a point into class j if its true class is i (at https://au.mathworks.com/help/stats/classificationsvm.html)

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!