Normalization inputs data & dividing data for training - validation- test

Question

omar belhaj on 14 Feb 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/178409-normalization-inputs-data-dividing-data-for-training-validation-test

Edited: Greg Heath on 15 Feb 2015

Could you help me please I have two questions about neural networks for solar irradiance forecasting. I used MLP model (Fitting) with one hidden layer, 7 inputs and 1 output (solar irradiation).My questions are the following : - It's necessary to use these following commands to normalize my inputs data ?? (I use a sigmoid function as activation function in hidden layer, and linear function in the ouput layer)

net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};

Or I can just use the simple mathematical formula : In=(Inn-Imin)/(Imax-Imin)

while In: normalized input ; Inn: No normalized input ???

- Second question is about dividing data for training, this is my code about dividing :

inputs = A';   % used for training 
targets = B';  % used for training 
inputsTesting=C';  % used for test unseen by neural network 
targetsTesting=D';  %used for test unseen by neural network
% Setup Division of Data for Training, Validation, Testing
net.divideFcn = 'dividerand';  % Divide data randomly
net.divideMode = 'sample';  % Divide up every sample
net.divideParam.trainRatio = 75/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 10/100;% this is my problem !!!!!
|*% Create a Fitting Network*|
   net=fitnet(Nubmer of nodes in haidden layer);

% tarining

net.trainFcn = 'trainlm';  % Levenberg-Marquardt
[net,tr] = train(net,inputs,targets);
 outputs = net(inputsTesting); % inputs Testing :unseen by neural network
 perf = mse(net,targetsTesting,outputs); % targets Testing: unseen by network

My question is what does mean this command below ???I think this command is unnecessary because i used data testing unseen by network?? !!! So what i can do about this mistak ?? !!!!

net.divideParam.testRatio = 10/100;

Neural network use 10% of data alerady seen for testing ??

please Help

best regards

1 Comment
Show -1 older commentsHide -1 older comments

Greg Heath on 15 Feb 2015

Edited: Greg Heath on 15 Feb 2015

Open in MATLAB Online

% REPLY 15FEB2015 % Normalization inputs data & dividing data for training - validation- test % Asked by omar belhaj about 21 hours ago % % Could you help me please I have two questions about neural networks % for solar irradiance forecasting. I used MLP model (Fitting) with one % hidden layer, 7 inputs and 1 output (solar irradiation).My questions % are the following : - It's necessary to use these following commands % to normalize my inputs data ?? (I use a sigmoid function as activation % function in hidden layer, and linear function in the ouput layer) % net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'}; % net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'}; % % Or I can just use the simple mathematical formula : In=(Inn-Imin)/(Imax-Imin) % % while In: normalized input ; Inn: No normalized input ???

Replace "while" with "where"

The current NN creation functions automatically use MAPMINMAX. So, as in my example below, you do not have to scale your data. On the other hand, you can use the above commands to either

 a. Use MAPSTD (zero-mean/unit-variance) for standardization 
    instead of MAPMINMAX
 b. Remove scaling

Although I prefer standardization before training to

 a. better deal with data errors and outliers, 
 b. estimate significant delays for time-series design using correlation functions (help nncorr)

I just use ZSCORE AND THEN let the program use the default MAPMIMAX

% - Second question is about dividing data for training, this is my code about dividing :

% inputs = A'; % used for training

% targets = B'; % used for training

% inputsTesting=C'; % used for test unseen by neural network

% targetsTesting=D'; %used for test unseen by neural network % % % Setup Division of Data for Training, Validation, Testing % % net.divideFcn = 'dividerand'; % Divide data randomly

% net.divideMode = 'sample'; % Divide up every sample

% net.divideParam.trainRatio = 75/100;

% net.divideParam.valRatio = 15/100;

% net.divideParam.testRatio = 10/100;% this is my problem !!!!!

The current NN creation functions automatically use the default DIVIDERAND with the fractional breakdown of 70/15/15 for trn/val/tst.

Only the training data is used to change weights. The validation data is only used to prevent bad performance on nontraining data. The net does not, in any way, use the test data for design. Therefore, there is no reason to use an extra "unseen" data set.

So, as in my example below, you do not have to explicitly divide your data.

However, I prefer to use DIVIDEBLOCK for timeseries to preserve the constant time delay correlations deduced from correlation functions.

% % Create a Fitting Network % % net=fitnet(Nubmer of nodes in haidden layer); % % % tarining % net.trainFcn = 'trainlm'; % Levenberg-Marquardt % [net,tr] = train(net,inputs,targets); % outputs = net(inputsTesting); % inputs Testing :unseen by neural network % perf = mse(net,targetsTesting,outputs); % targets Testing: unseen by network % % My question is what does mean this command below ??? % I think this command is unnecessary because i used data t % esting unseen by network?? !!! So what i can do about this % mistake ?? !!!! % net.divideParam.testRatio = 10/100; % Neural network use 10% of data already seen for testing ??

The test data, IN NO WAY influences the design! That is why it is called TEST data!!

Therefore there is no reason to explicitly hold out data for testing.

However, you can change the ratios and type of division if you wish; Just make sure they add up to 1.

Sign in to comment.

Sign in to answer this question.

Answer 1

Greg Heath on 15 Feb 2015

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/178409-normalization-inputs-data-dividing-data-for-training-validation-test#answer_168045

Edited: Greg Heath on 15 Feb 2015

Open in MATLAB Online

1. See the description and example used in the help fitnet and doc fitnet documentation

 [x,t] = simplefit_dataset;
 net   = fitnet(10);
 net   = train(net,x,t);
 view(net)
 y    = net(x);
 perf = perform(net,y,t)
 perf =
   1.4639e-04

2. For some reason (BUG?) the numerical result is only given in doc fitnet, not in help fitnet.

3. HOWEVER, the result is scale dependent ( Since perf for fitnet is mean-squared-error mse(t-y), multiplying t by positive number a will increase perf by a^2

4. Therefore, normalize perf with the performance of the NAIVE CONSTANT MODEL y = constant whose error is minimized when the constant is the mean of the target

   y00   = repmat( mean(t,2), 1, size(t,2))
   MSE00 = mse( t - y00 )
   MSE00 = mean(var(t',1))  % 8.3378
   nperf = mse(t-y)/MSE00   % 1.7557e-05

5. Note that the only input is the number of hidden nodes H = 10.

6. HOWEVER, reading the corresponding documentation indicates that H = 10 is a default.

7. Therefore, the net creation statement can be replaced by

net = fitnet;

8. HOWEVER, there will be a different answer each time the code is run. This is because of

    a. Default RANDOM data division 70/15/15
    b. Default RANDOM initial weights

9. In order to duplicate results, initialize the RNG to the same initial state ( your choice ) BEFORE the train statement.

10. That is all you need to get a repeatable result.

11. HOWEVER, since the initial weights are random, there is no guarantee that the automatic choice is successful. In addition, in the general case, there is no guarantee that H = 10 is a good choice.

12. This can be mitigated by designing multiple nets in a double for loop over H = Hmin:dH:Hmax and Nweighttrials = 1:Ntrials. Then choosing the net with best validation set performance result.

13. HOWEVER, the perf value obtained above combines the result for all of the data subsets: trn, val and tst.

14. To obtain separate results for each subset, use the training record tr from

rng('default')    % Or your favorite RNG state
[ net tr y e ] = train( net, x, t);
 % y = net(x);    % output
 % e = t-y;       % error
 NMSE    = mse(e)/MSE00
 NMSEtrn = tr.best_perf/MSE00   % BIASED: trn used to obtain weights
 NMSEval = tr.best_vperf/MSE00  % BIASED: val used to stop training AND pick best of multiple designs
 NMSEtst = tr.best_tperf/MSE00  % Use to obtain UNBIASED performance estimate

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 [ x ,t ] = simplefit_dataset;
 MSE00    = mean( var( t',1) ) %  8.3378
 net      = fitnet;
 rng( 'default' )
 [net tr y e] = train(net, x, t);
 view(net)
 NMSE    = mse(e)/MSE00        % 1.7558e-05
 NMSEtrn = tr.best_perf/MSE00  % 1.4665e-05
 NMSEval = tr.best_vperf/MSE00 % 1.06e-05
 NMSEtst = tr.best_tperf/MSE00 % 3.8155e-05

Hope this helps,

Thank you for formally accepting my answer

Greg

PS Many real world examples will require searches for H and initial weights. If you reuse the same net, be sure to use the function CONFIGURE for weight initialization.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Normalization inputs data & dividing data for training - validation- test

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Normalization inputs data & dividing data for training - validation- test

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments