Test data Neural Network

5 views (last 30 days)
Edo
Edo on 25 Oct 2012
Hi everyone,
I write down a code for a timedelay neural network which predicts 2 steps ahead an output y having a matrix on inputs X. I have 630 timesteps. Using the "Sequential order incremental training with learning functions" algorythm (the best one for results achieved), I always get good results" low mse (about 1% of the mean of my targets) and R=0.9 for all the 3 data sets. I divide the data by blocks 3:1:1, in this way I can imagine the test data as an independent data set added later (by definition, the test set is independent from the training one), and I got still good performances.
Then I try in this way: I took the 4/5 of my data, and give them as my new inputs, leaving the last 1/5 as independent data set for testing. I train those data without testing (training 80% validation 20%). In this way, that last 1/5 corresponds to the 20% of testing data of my first simulation. Hence, when I test the network with my independent dataset (outputtest=net(Inputtest) after been trained, I was expecting a similar performance to the test data of 1st simulation (mse low, R=0.9), but it did not happen (never, I have tried many times) and I always get bad performance.
Therefore my question is: what are actually the test data used when training? If really independent, why didn't I get similar results? both test data, or independent new data, are supposed to use, for their inputs, the weights previously updated with the training, and calculate straight away the outputs.
Is what I have written wrong? How can i get the same prediction performance of the test data with new data?
Thank you!

Accepted Answer

Greg Heath
Greg Heath on 25 Oct 2012
I assume target T is 1-dimensional.
What is the size of the X matrix?
Are X(i,:) and T stationary (e.g., are the ten 63 point means and variances of each variable time-invariant?)
I recommend standardizing X and T with zscore or mapstd.
What are the significant crosscorrelation function lags between X(i,:) and T?
Are you including zero input lag? Unfortunately, it is not a default.
MSE results are more accurately assessed by % of target variance.
MSE00 = mean(var(T',1)) %MSE of NAIVE constant model y = mean(T,2)
NMSE = MSE/MSE00 % Normalized MSE
R^2 = 1-NMSE % Fraction of target variance modeled (Search coefficient of determination in Wikipedia)
Have you replaced the 'dividerand' (correlation destroying) default division option with 'divideblock' or another divide option?
Hope this helps.
Thank you for formally accepting my answer.
Greg
  1 Comment
Greg Heath
Greg Heath on 25 Oct 2012
In addition to my previous answer. Let me answer your questions directly.
The test set is in no way involved in estimating weights.
The validation set only determines when to stop the weight estimation.
Initial weights are determined randomly. For each candidate value for number of hidden nodes, H, I typically design 10 nets from 10 different sets of initial weights.
Greg

Sign in to comment.

More Answers (3)

Greg Heath
Greg Heath on 30 Oct 2012
Test data Neural Network
Asked by Edo on 25 Oct 2012 at 7:10
Latest activity Commented on by Edo about 11 hours ago
% Edo about 11 hours ago Link Direct link to this comment:
That makes no sense. You must mean 27 rows ... the input and target must have the same number of columns ... the same number as the number of timesteps!
%Apparently, with Fitnet or other regression networks you would be right % because you need to transpose those matrices. But with timedelay % you take them in that form.
Incorrect. You can have gaps in the delays! For example you could have inputdelays = [ 0 2 4 8 13 ] and feedbackdelays = [1 2 5 8 10 ].
% That is a good info, thank you. Unfortunately if I include delays, the results % worsen. That is why I am using also fitnet, looking for a regression % between Output(t+2) and Inputs(t), and I got similar results, without % using delays. The main problem now, is that my chart (with both networks) % looks shifted 2 timesteps on the right (as if it is predicting output(t) % instead of outputs(t+2). In either cases I have not used output(t) as an % input. I don't know why this happens.
I'd have to see your code. Step-by-step design comparisons must include the same RNG initialization
Remedies for poor test set performance vs good training set performance iinclude removing insignificant inputs, reducing the number of hidden nodes, using validation stopping and using regularization via MSEREG instead of MSE.
% About hidden nodes, my network now has only 5 hidden nodes (no significant % performance improvements if I increase that number, despite 27 inputs), % pretty simple then.
The number of significant inputs, rather than the total number, should dictate the minimum practical number of hidden nodes.
% About validation stopping, yes I am using it, although I increase the minimum % number of validation checks from 6 to 60 to reduce the risk of being stuck in % local minima .
No.
The purpose of validation stopping is to stop training when it is obvious that the training data does not sufficiently characterize the significant characteristics of the rest of of the data. 6 is a reasonable number. Using 60 will result in a net that certainly will not perform well on validation data and, most likely, will not perform well on test data.
% ( it is a complex system and it was often stopping before needed).
No.
See comp.ai.neural-nets FAQ to better understand the purpose of validation stopping.
% About MSEREG, apparently this leads to worse results; the best % performance criteria is Mae.
That doesn't sound right. You need to be specific and quantify what you mean by 'worse' and 'best'. It would also help to include relevant code.
% Which methods to eliminate insignificant inputs would you recommend to me?
A. Design a good net
1. Standardize the trn data ( ZSCORE or MAPSTD )
2. Normalize the val and tst data with the trn means and stdvs.
3. Choose ~ 10 or less candidate values for H = numhidden (0 H <= Hmax) If possible, choose Hmax small enough that Ntrneq > Nw where
Ntrneq = numtrainingequations = Ntrn*O
Nw = net.numWeightElements = (I+NNZD+1)*H+(H+1)*O
NNZD = numnonzerodelays
4. Initialize the random number generator so that the experiment cam be duplicated.
5. For each value of H, design Ntrials = 10 nets with random weight initializations, 'divideblock' and max_fail = 6.
6.Tabulate values of Nepochs, R2trn, R2val and R2tst in Ntrials X numH sized matrices.
R^2 = 1 - mse( T - Y ) / mean( var( T, 1, 2 ))
7. Choose a net that achieves a large R2val with a small H.
B. Obtain a good subset of input variables
1. Sequentially randomize the row of only one variable, and obtain the resulting R^2
2. If the highest R2val is unacceptable, STOP. Otherwise remove the corresponding least significant variable and and go back to 1.
3. Reasonable modifications are acceptable. For eample
a. Use Ntrial randomizations for each variable in 1.
b. Continue training after removing each least significant variable
Have fun,
Greg

Greg Heath
Greg Heath on 3 Nov 2012
One of the basic assumptions of NN model design is that the trn, val and tst data can be assumed to come from the same probability distribution.
I recommend that this be checked before trying to train, validate and test a model.
The quickest check is to plot the data and obtain summary stats like min, median, mean, std and max (Some may prefer percentiles or quartiles).
The plots of the target data you sent showed a significant difference between the training and testing data. For example (both mins and medians were 0.0050)
ttrn ttst
mean 0.0283 0.0119
std 0.0565 0.0156
max 0.3900 0.1000
Therefore, you need to change your tests of generalization to more appropriate data.
Hope this helps.
Greg

Edo
Edo on 26 Oct 2012
  • |I assume target T is 1-dimensional.
  • What is the size of the X matrix?
Target 1d, X matrix has 27 columns for 630 timesteps.
  • Are X(i,:) and T stationary (e.g., are the ten 63 point means and variances of each variable time-invariant?)
Most of the inputs are stationary, the output though shows a little trend with the mean slowly decreasing.
  • I recommend standardizing X and T with zscore or mapstd.I substituted mapminmax with mapstd and it improves a little bit, thanks
  • What are the significant crosscorrelation function lags between X(i,:) and T?
Many lags are significant as they are all cyclical variables showing the same trend every year: therefore I found many significant lags either around lags 26 (half year) or 52 (full year). But if I tried to Inputs delays 2:54 and include all of them (as there is no way to select only few of them) the performance drops. I am working now only with one step delay.
  • Are you including zero input lag? Unfortunately, it is not a default.
Apparently if you want to predict 2 steps ahead (as in my case), you should set inputDelays = 2:i, so that y(t)=f[x(t-2)..x(t-i)]. Still, I am not sure about that becasue the charts look shifted a bit, predicting the right values late.
  • Have you replaced the 'dividerand' (correlation destroying) default division option with 'divideblock' or another divide option?|
Yes I have used divideblock to keep all the test data at the end. That's why I have this problem: using the same dataset used as a test during training, if I simulate the network (this time after being trained only with the first 80% of data, leaving obviously the test data out) with that data, I don't get a similar response. Why? The training has been don on the same data, the only difference was testing the data "outside" the training procedure, after the training was completed (Y=net(Xtest)).
  • The test set is in no way involved in estimating weights.That's what I knew, but apparently I always get better results with the test set into training process, rather than using that data later.
Thanks for the reply.
  2 Comments
Greg Heath
Greg Heath on 26 Oct 2012
GEH: I assume target T is 1-dimensional. What is the size of the X matrix?
%Target 1d, X matrix has 27 columns for 630 timesteps.
That makes no sense. You mustmean 27 rows ... the input and target must have the same number of columns ... the same number as the number of timesteps!
size(X) = [ 27 630 ]
size(T) = [ 1 630 ]
Are X(i,:) and T stationary (e.g., are the ten 63 point means and variances of each variable time-invariant?)
% Most of the inputs are stationary, the output though shows a little trend with the mean slowly decreasing.
I recommend standardizing X and T with zscore or mapstd.
% I substituted mapminmax with mapstd and it improves a little bit, thanks
What are the significant crosscorrelation function lags between X(i,:) and T?
% Many lags are significant as they are all cyclical variables showing the same trend every year: therefore I found many significant lags either around lags 26 (half year) or 52 (full year). But if I tried to Inputs delays 2:54 and include all of them (as there is no way to select only few of them) the performance drops.
Incorrect. You can have gaps in the delays! For example you could have inputdelays = [ 0 2 4 8 13 ] and feedbackdelays = [1 2 5 8 10 ].
%I am working now only with one step delay
With 630 I/O pairs, and 0.7*630 = 441 used for training you can easily afford to use more delays.
Are you including zero input lag? Unfortunately, it is not a default.
% Apparently if you want to predict 2 steps ahead (as in my case), you % should set inputDelays = 2:i, so that y(t)=f[x(t-2)..x(t-i)]. Still, I am % not sure about that becasue the charts look shifted a bit, predicting the % right values late.
That is correct except the delay vectors can have gaps.
Have you replaced the 'dividerand' (correlation destroying) default division option with 'divideblock' or another divide option?|
%Yes I have used divideblock to keep all the test data at the end. That's why I have this problem: using the same dataset used as a test during training, if I simulate the network (this time after being trained only with the first 80% of data, leaving obviously the test data out) with that data, I don't get a similar response. Why? The training has been don on the same data, the only difference was testing the data "outside" the training procedure, after the training was completed (Y=net(Xtest)).
This could be a case of overtraining an overfit network. There are more weights than necessary resulting in an excess of degrees of freedom that are not constrained by the underlying deterministic I/O characteristic of the data. Instead, the excess degress of freedom lead to fitting training data noise and other idyosyncracies that are not contained in nontraining data. Remedies include removing insignificant inputs, reducing the number of hidden nodes, using validation stopping and using regularization via MSEREG instead of MSE. See the spiel in the comp.ai.neural-nets FAQ on overfitting.
The test set is in no way involved in estimating weights.
% That's what I knew, but apparently I always get better results with the test set into training process, rather than using that data later.
That's why the test set should not be involved in the trainng. It's purpose is for estimating performance on nondesign data.
% Thanks for the reply.
Are you using validation stopping and/or trying to minimize input nodes and/or hidden nodes?
Hope this helps.
Greg
Edo
Edo on 28 Oct 2012
Edited: Edo on 28 Oct 2012
  • That makes no sense. You mustmean 27 rows ... the input and target must have the same number of columns ... the same number as the number of timesteps!
Apparently, with Fitnet or other regression networks you would be right because you need to transpose those matrices. But with timedelay you take them in that form.
  • Incorrect. You can have gaps in the delays! For example you could have inputdelays = [ 0 2 4 8 13 ] and feedbackdelays = [1 2 5 8 10 ].
That is a good info, thank you. Unfortunately if I include delays, the results worsen. That is why I am using also fitnet, looking for a regression between Output(t+2) and Inputs(t), and I got similar results, without using delays. The main problem now, is that my chart (with both networks) looks shifted 2 timesteps on the right (as if it is predicting output(t) instead of outputs(t+2). In either cases I have not used output(t) as an input. I don't know why this happens.
  • Remedies include removing insignificant inputs, reducing the number of hidden nodes, using validation stopping and using regularization via MSEREG instead of MSE.
About hidden nodes, my network now has only 5 hidden nodes (no significant performance improvements if I increase that number, despite 27 inputs), pretty simple then. About validation stopping, yes I am using it, although I increase the minimum number of validation checks from 6 to 60 to reduce the risk of being stuck in local minima ( it is a complex system and it was often stopping before needed). About MSEREG, apparently this leads to worse results; the best performance criteria is Mae. Which methods to eliminate insignificant inputs would you recommend to me?

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!