Main Content

Shallow Neural Networks with Parallel and GPU Computing


For deep learning, parallel and GPU support is automatic. You can train a convolutional neural network (CNN, ConvNet) or long short-term memory networks (LSTM or BiLSTM networks) using the trainnet function, and choose the execution environment (CPU, GPU, multi-GPU, and parallel) using trainingOptions.

Training in parallel, or on a GPU, requires Parallel Computing Toolbox™. For more information on deep learning with GPUs and in parallel, see Deep Learning with Big Data on CPUs, GPUs, in Parallel, and on the Cloud.

Modes of Parallelism

Neural networks are inherently parallel algorithms. Multicore CPUs, graphical processing units (GPUs), and clusters of computers with multiple CPUs and GPUs can take advantage of this parallelism.

Parallel Computing Toolbox, when used in conjunction with Deep Learning Toolbox™, enables neural network training and simulation to take advantage of each mode of parallelism.

For example, the following shows a standard single-threaded training and simulation session:

[x, t] = bodyfat_dataset;
net1 = feedforwardnet(10);
net2 = train(net1, x, t);
y = net2(x);

The two steps you can parallelize in this session are the call to train and the implicit call to sim (where the network net2 is called as a function).

In Deep Learning Toolbox you can divide any data, such as x and t in the previous example code, across samples. If x and t contain only one sample each, there is no parallelism. But if x and t contain hundreds or thousands of samples, parallelism can provide both speed and problem size benefits.

Distributed Computing

Parallel Computing Toolbox allows neural network training and simulation to run across multiple CPU cores on a single PC, or across multiple CPUs on multiple computers on a network using MATLAB® Parallel Server™.

Using multiple cores can speed calculations. Using multiple computers can allow you to solve problems using data sets too big to fit in the RAM of a single computer. The only limit to problem size is the total quantity of RAM available across all computers.

To manage cluster configurations, use the Cluster Profile Manager from the MATLAB Home tab Environment menu Parallel > Manage Cluster Profiles.

To open a pool of MATLAB workers using the default cluster profile, which is usually the local CPU cores, use this command:

pool = parpool
Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers.

When parpool runs, it displays the number of workers available in the pool. Another way to determine the number of workers is to query the pool:


Now you can train and simulate the neural network with data split by sample across all the workers. To do this, set the train and sim parameter 'useParallel' to 'yes'.

net2 = train(net1,x,t,'useParallel','yes')
y = net2(x,'useParallel','yes')

Use the 'showResources' argument to verify that the calculations ran across multiple workers.

net2 = train(net1,x,t,'useParallel','yes','showResources','yes');
y = net2(x,'useParallel','yes','showResources','yes');

MATLAB indicates which resources were used. For example:

Computing Resources:
Parallel Workers
  Worker 1 on MyComputer, MEX on PCWIN64
  Worker 2 on MyComputer, MEX on PCWIN64
  Worker 3 on MyComputer, MEX on PCWIN64
  Worker 4 on MyComputer, MEX on PCWIN64

When train and sim are called, they divide the input matrix or cell array data into distributed Composite values before training and simulation. When sim has calculated a Composite, this output is converted back to the same matrix or cell array form before it is returned.

However, you might want to perform this data division manually if:

  • The problem size is too large for the host computer. Manually defining the elements of Composite values sequentially allows much bigger problems to be defined.

  • It is known that some workers are on computers that are faster or have more memory than others. You can distribute the data with differing numbers of samples per worker. This is called load balancing.

The following code sequentially creates a series of random datasets and saves them to separate files:

pool = gcp;
for i=1:pool.NumWorkers
  x = rand(2,1000);
  save(['inputs' num2str(i)],'x');
  t = x(1,:) .* x(2,:) + 2 * (x(1,:) + x(2,:));
  save(['targets' num2str(i)],'t');
  clear x t

Because the data was defined sequentially, you can define a total dataset larger than can fit in the host PC memory. PC memory must accommodate only a sub-dataset at a time.

Now you can load the datasets sequentially across parallel workers, and train and simulate a network on the Composite data. When train or sim is called with Composite data, the 'useParallel' argument is automatically set to 'yes'. When using Composite data, configure the network’s input and outputs to match one of the datasets manually using the configure function before training.

xc = Composite;
tc = Composite;
for i=1:pool.NumWorkers
  data = load(['inputs' num2str(i)],'x');
  xc{i} = data.x;
  data = load(['targets' num2str(i)],'t');
  tc{i} = data.t;
  clear data
net2 = configure(net1,xc{1},tc{1});
net2 = train(net2,xc,tc);
yc = net2(xc);

To convert the Composite output returned by sim, you can access each of its elements, separately if concerned about memory limitations.

for i=1:pool.NumWorkers
  yi = yc{i}

Combined the Composite value into one local value if you are not concerned about memory limitations.

y = {yc{:}};

When load balancing, the same process happens, but, instead of each dataset having the same number of samples (1000 in the previous example), the numbers of samples can be adjusted to best take advantage of the memory and speed differences of the worker host computers.

It is not required that each worker have data. If element i of a Composite value is undefined, worker i will not be used in the computation.

Single GPU Computing

The number of cores, size of memory, and speed efficiencies of GPU cards are growing rapidly with each new generation. Where video games have long benefited from improved GPU performance, these cards are now flexible enough to perform general numerical computing tasks like training neural networks.

For the latest GPU requirements, see the web page for Parallel Computing Toolbox; or query MATLAB to determine whether your PC has a supported GPU. This function returns the number of GPUs in your system:

count = gpuDeviceCount
count =


If the result is one or more, you can query each GPU by index for its characteristics. This includes its name, number of multiprocessors, SIMDWidth of each multiprocessor, and total memory.

gpu1 = gpuDevice(1)
gpu1 = 

  CUDADevice with properties:

                      Name: 'NVIDIA RTX A5000'
                     Index: 1
         ComputeCapability: '8.6'
            SupportsDouble: 1
             DriverVersion: 11.6000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 25553076224 (25.55 GB)
           AvailableMemory: 25153765376 (25.15 GB)
       MultiprocessorCount: 64
              ClockRateKHz: 1695000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

The simplest way to take advantage of the GPU is to specify call train and sim with the parameter argument 'useGPU' set to 'yes' ('no' is the default).

net2 = train(net1,x,t,'useGPU','yes')
y = net2(x,'useGPU','yes')

If net1 has the default training function trainlm, you see a warning that GPU calculations do not support Jacobian training, only gradient training. So the training function is automatically changed to the gradient training function trainscg. To avoid the notice, you can specify the function before training:

net1.trainFcn = 'trainscg';

To verify that the training and simulation occur on the GPU device, request that the computer resources be shown:

net2 = train(net1,x,t,'useGPU','yes','showResources','yes')
y = net2(x,'useGPU','yes','showResources','yes')

Each of the above lines of code outputs the following resources summary:

Computing Resources:
GPU device #1, GeForce GTX 470

Many MATLAB functions automatically execute on a GPU when any of the input arguments is a gpuArray. Normally you move arrays to and from the GPU with the functions gpuArray and gather. However, for neural network calculations on a GPU to be efficient, matrices need to be transposed and the columns padded so that the first element in each column aligns properly in the GPU memory. Deep Learning Toolbox provides a special function called nndata2gpu to move an array to a GPU and properly organize it:

xg = nndata2gpu(x);
tg = nndata2gpu(t);

Now you can train and simulate the network using the converted data already on the GPU, without having to specify the 'useGPU' argument. Then convert and return the resulting GPU array back to MATLAB with the complementary function gpu2nndata.

Before training with gpuArray data, the network’s input and outputs must be manually configured with regular MATLAB matrices using the configure function:

net2 = configure(net1,x,t);  % Configure with MATLAB arrays
net2 = train(net2,xg,tg);    % Execute on GPU with NNET formatted gpuArrays
yg = net2(xg);               % Execute on GPU
y = gpu2nndata(yg);          % Transfer array to local workspace

On GPUs and other hardware where you might want to deploy your neural networks, it is often the case that the exponential function exp is not implemented with hardware, but with a software library. This can slow down neural networks that use the tansig sigmoid transfer function. An alternative function is the Elliot sigmoid function whose expression does not include a call to any higher order functions:

(equation)	a = n / (1 + abs(n))

Before training, the network’s tansig layers can be converted to elliotsig layers as follows:

for i=1:net.numLayers
  if strcmp(net.layers{i}.transferFcn,'tansig')
    net.layers{i}.transferFcn = 'elliotsig';

Now training and simulation might be faster on the GPU and simpler deployment hardware.

Distributed GPU Computing

Distributed and GPU computing can be combined to run calculations across multiple CPUs and/or GPUs on a single computer, or on a cluster with MATLAB Parallel Server.

The simplest way to do this is to specify train and sim to do so, using the parallel pool determined by the cluster profile you use. The 'showResources' option is especially recommended in this case, to verify that the expected hardware is being employed:

net2 = train(net1,x,t,'useParallel','yes','useGPU','yes','showResources','yes')
y = net2(x,'useParallel','yes','useGPU','yes','showResources','yes')

These lines of code use all available workers in the parallel pool. One worker for each unique GPU employs that GPU, while other workers operate as CPUs. In some cases, it might be faster to use only GPUs. For instance, if a single computer has three GPUs and four workers each, the three workers that are accelerated by the three GPUs might be speed limited by the fourth CPU worker. In these cases, you can specify that train and sim use only workers with unique GPUs.

net2 = train(net1,x,t,'useParallel','yes','useGPU','only','showResources','yes')
y = net2(x,'useParallel','yes','useGPU','only','showResources','yes')

As with simple distributed computing, distributed GPU computing can benefit from manually created Composite values. Defining the Composite values yourself lets you indicate which workers to use, how many samples to assign to each worker, and which workers use GPUs.

For instance, if you have four workers and only three GPUs, you can define larger datasets for the GPU workers. Here, a random dataset is created with different sample loads per Composite element:

numSamples = [1000 1000 1000 300];
xc = Composite;
tc = Composite;
for i=1:4
  xi = rand(2,numSamples(i));
  ti = xi(1,:).^2 + 3*xi(2,:);
  xc{i} = xi;
  tc{i} = ti;

You can now specify that train and sim use the three GPUs available:

net2 = configure(net1,xc{1},tc{1});
net2 = train(net2,xc,tc,'useGPU','yes','showResources','yes');
yc = net2(xc,'showResources','yes');

To ensure that the GPUs get used by the first three workers, manually converting each worker’s Composite elements to gpuArrays. Each worker performs this transformation within a parallel executing spmd block.

  if spmdIndex <= 3
    xc = nndata2gpu(xc);
    tc = nndata2gpu(tc);

Now the data specifies when to use GPUs, so you do not need to tell train and sim to do so.

net2 = configure(net1,xc{1},tc{1});
net2 = train(net2,xc,tc,'showResources','yes');
yc = net2(xc,'showResources','yes');

Ensure that each GPU is used by only one worker, so that the computations are most efficient. If multiple workers assign gpuArray data on the same GPU, the computation will still work but will be slower, because the GPU will operate on the multiple workers’ data sequentially.

Parallel Time Series

For time series networks, simply use cell array values for x and t, and optionally include initial input delay states xi and initial layer delay states ai, as required.

net2 = train(net1,x,t,xi,ai,'useGPU','yes')
y = net2(x,xi,ai,'useParallel','yes','useGPU','yes')

net2 = train(net1,x,t,xi,ai,'useParallel','yes')
y = net2(x,xi,ai,'useParallel','yes','useGPU','only')

net2 = train(net1,x,t,xi,ai,'useParallel','yes','useGPU','only')
y = net2(x,xi,ai,'useParallel','yes','useGPU','only')

Note that parallelism happens across samples, or in the case of time series across different series. However, if the network has only input delays, with no layer delays, the delayed inputs can be precalculated so that for the purposes of computation, the time steps become different samples and can be parallelized. This is the case for networks such as timedelaynet and open-loop versions of narxnet and narnet. If a network has layer delays, then time cannot be “flattened” for purposes of computation, and so single series data cannot be parallelized. This is the case for networks such as layrecnet and closed-loop versions of narxnet and narnet. However, if the data consists of multiple sequences, it can be parallelized across the separate sequences.

Parallel Availability, Fallbacks, and Feedback

As mentioned previously, you can query MATLAB to discover the current parallel resources that are available.

To see what GPUs are available on the host computer:

gpuCount = gpuDeviceCount
for i=1:gpuCount

To see how many workers are running in the current parallel pool:

poolSize = pool.NumWorkers

To see the GPUs available across a parallel pool running on a PC cluster using MATLAB Parallel Server:

  worker.index = spmdIndex; = system('hostname');
  worker.gpuCount = gpuDeviceCount;
    worker.gpuInfo = gpuDevice;
    worker.gpuInfo = [];

When 'useParallel' or 'useGPU' are set to 'yes', but parallel or GPU workers are unavailable, the convention is that when resources are requested, they are used if available. The computation is performed without error even if they are not. This process of falling back from requested resources to actual resources happens as follows:

  • If 'useParallel' is 'yes' but Parallel Computing Toolbox is unavailable, or a parallel pool is not open, then computation reverts to single-threaded MATLAB.

  • If 'useGPU' is 'yes' but the gpuDevice for the current MATLAB session is unassigned or not supported, then computation reverts to the CPU.

  • If 'useParallel' and 'useGPU' are 'yes', then each worker with a unique GPU uses that GPU, and other workers revert to CPU.

  • If 'useParallel' is 'yes' and 'useGPU' is 'only', then workers with unique GPUs are used. Other workers are not used, unless no workers have GPUs. In the case with no GPUs, all workers use CPUs.

When unsure about what hardware is actually being employed, check gpuDeviceCount, gpuDevice, and pool.NumWorkers to ensure the desired hardware is available, and call train and sim with 'showResources' set to 'yes' to verify what resources were actually used.