RMSE of model with standardized input

1 view (last 30 days)
Hadi Hadi
Hadi Hadi on 13 Apr 2015
Commented: Star Strider on 13 Apr 2015
Hi all,
I am running a model which requires to standardize the input predictor and response variables. I wish to calculate the RMSE in original input response unit as in the following script. I standardize my input training and test data with mean and standard deviation of the training set. However, the magnitude of the RMSE is not as I expected. Appreciate if anyone can suggest where did I make mistake in my script. Many thanks :)
% Calculate mean and sd of data_train
% data_train and data_test is matrix with response in 1st column
mean_train=mean(data_train); sd_train=std(data_train);
% Standardize data_train and data_test with mean and sd of data_train
zdata_train=(data_train-repmat(mean_train,[size(data_train,1) 1]))./ ...
repmat(sd_train, [size(data_train,1) 1]);
zdata_test=(data_test-repmat(mean_train,[size(data_test,1) 1]))./ ...
repmat(sd_train, [size(data_test,1) 1]);
xtrain=zdata_train(:,2:end); ytrain=zdata_train(:,1);
xtest=zdata_test(:,2:end); ytest=zdata_test(:,1);
% Run model with output test set predicted response (standardized) ymu_te
% Calculate RMSE in original y unit
ymu_te = ymu_te.*sd_train(:,1) + mean_train(:,1); % response in 1st column
RMSE_test=sqrt(mean((ytrain-ymu_te).^2));

Answers (1)

Star Strider
Star Strider on 13 Apr 2015
I’m not following what you’re doing. Your ‘zdata_test’ seem to be using your training data in their calculations, so that may be giving you anomalous results. If you have the Statistics Toolbox, use the zscore function. If not, this works as well (tested against zscore):
z_score = @(data) bsxfun(@rdivide,bsxfun(@minus,data,mean(data)),std(data));
Your data must be in column-major order, so that each variable is in a column, and observations correspond to rows.
  2 Comments
Hadi Hadi
Hadi Hadi on 13 Apr 2015
Hi Star thanks for reply, I standardize the test set with mean and std of training set because I read somewhere that it is the right independent validation procedure (test set should not be 'seen' at any ways!). Also, since my test set in my 10-fold CV procedure contains very few cases the mean and std may not be representative of the population, as far as I understand. And by the way, previously I did the same analysis by standardizing the test set with their mean and std and still the transformation from the zscore of the predicted response to its original unit gives me unexpected magnitude. The problem is I cannot compare the RMSE in z unit with RMSE from other models I tested which uses RMSE in original unit, and I feel like I since in each fold of the CV, the training set is different I should use the mean and std of each training set realization to transform the standardized predicted response to original unit. Is it the right procedure? Thanks.
Star Strider
Star Strider on 13 Apr 2015
My impression is that the training data and test data should be individually standardised.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!