Simulating mini_batches with shallow NN train function

5 views (last 30 days)
I have large datasets. 200X840000 inputs and 6X840000 targets. How could I use Train to train on say 5000 at a time, across the entire data set and keep performance up across the whole data set so that I don't nessasarily have to worry about handling all of the data at once. Kind of like mocking the mini-batch technique of deep training but for shallow training on huge data sets. Below is what I have come up with.
rng('shuffle');
neurons = 12;
epochs = 2;
miniBatchSize = 3000;
miniBatch = single([]);
TrainI = [];
TrainT = [];
[AllTrain,AllTest] = dividerand(GenTogAllData, 0.91, 0.09);
net = fitnet(neurons);
net.trainfcn = 'trainscg';
net.trainParam.showWindow=1;
net.trainParam.epochs=1;
ii = 1;
k=1;
p =0;
tic
for i = 1:epochs
p = p+1
k = k+1;
j = 1;
ii = 1;
randomNumbers = randperm(size(AllTrain,2));
while j <=size(AllTrain,2)
miniBatch(:,ii)=single(AllTrain(:,randomNumbers(j)));
j = j+1;
ii = ii+1;
if size(miniBatch,2) == miniBatchSize
TrainI = miniBatch(1:(size(AllTrain, 1)-parameters), :);
TrainT = miniBatch((size(AllTrain, 1)-parameters+1):(size(AllTrain, 1)), :);
net = train(net, TrainI, TrainT);
miniBatch = [];
TrainI = [];
TrainT = [];
ii = 1;
elseif size(miniBatch) <= miniBatchSize
end
end
It runs much quicker per epoch, as I have them, vs an epoch of the entire data, but the best behavior I reach like this is never as good as when I allow the network to train on the entire data set for a long time. I know it is batch training one epoch at a time, you can easily try adapt as well and establish your own performance criteria and it still doesnt do as well.
Is there a fundamental reason why we might not be able to this? I will have more data than I can fit in RAM soon and want to achieve the performance I know the shallow NN can across the entire data set but in smaller batches. This relates to another question I asked in which I cant fit all of these 840000 on a gpu but can fit 300000. so then how would I train 300000 at a time and still keep performance across 840000??
I know I can use some Deep NN tools for help here, but I am bout to ask another question about how I might try that too, and want to keep this on how to use shallow NN to achieve this because I know the shallow NN performs on this data set, and the deep NN stuff is its own beast.
Thank you in advance for any help here.
  2 Comments
Greg Heath
Greg Heath on 27 Aug 2018
When you have very large datasets an excellent approach is to FIRST consider reduction of BOTH number and dimensionality.
Consider a 1-D Gaussian distribution. How many random draws are necessary for an acceptable estimate of its mean and covariance matrix? How does that change for 2 and 3-D ?
Hope this helps.
Greg
Harley Edwards
Harley Edwards on 27 Aug 2018
So I know some PCA should help reduce memory draw, but I intend on scaling this up even further. I see what you mean I shouldn't need 840000 points to train on the data, but assume I have 840000 USEFUL points. How can I achieve the behavior in question of simulating minibatches and keeping acceptable behavior across the entire set?
To answer your question directly Greg, I found this link. https://stats.stackexchange.com/questions/59478/when-data-has-a-gaussian-distribution-how-many-samples-will-characterise-it
But I do not think it helps to achieve this simulated minibatch behavior.
Most genuinely, Thank you for your help.

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!