I am working on a machine with a number of CPU cores (40) and a number of GPUs (4). I need to train a large number of shallow LSTM neural networks (~500,000), and would like to use my compute resources as efficiently as possible.
Here are the options I've come up with:
1) parpool('local') gives 40 workers max, which are the number of CPU cores available. Apparently parpool('local') does not provide access to the GPUs - is this correct? I can then use spmd to launch separate instances of trainNetwork across individual CPUs on my machine, and this runs 40 such instances at a time.
I have three questions about this:
First, is there a way to use both the GPUs and CPUs as separate laboratories (i.e., with different labindex values) in my spmd loop? Why do I not have a total of 44 avaialble workers from parpool?
Second, is there a way to assign more than one CPU to a particular lab, for example, could I divide my 40 cores up into 8 gorups of 5 and deploy a separate instance of trainNetwork to each of the 8 groups?
Third, given that I am using LSTMs, my 'ExecutionEnvironment' options are 'gpu', 'cpu', and 'auto', but it apears that the 'cpu' option uses more than one cpu at a time, because the timing for each task increases by a factor of about ~6 when I use spdm vs. only running one instance of trainNetwork (with 'ExecutionEnvironment' = 'cpu') at a time - this leads me to belive that when I run a single instance of trainNetowrk with 'ExecutionEnvironment' = 'cpu' it uses more than one CPU core. Is this correct?
2) I can access the GPUs individually using gpuDevice, and I can run 4 instances of trainNetwork simultaneously on my 4 GPUs. This works well, with effectively linear spedup as compared to only using one GPU at a time, but apparently does not take advantage of my CPUs.
Ideally, I'd lke a way to (1) test scaling across multiple CPUs for my partiucalr trainNetwork problem, and (2) a way to run multiple parallel instances of trainNetwork that use all of my hardware. Ideally, the best option seems to be to let the GPUs each take a number of the trainNetwork instances in parallel, and then to deploy groups of CPUs (with optimal size currently unknown) to handle a number of the trainNetwork instances.
Is there a way to do this?