I am using spmd to enable parallel computing with multiple GPUs on one workstation. Basically, the GPUs do some calculation, broadcast their results, update their parameters, and iterate. The problem is, using labSend (actually, gplus in my case) to aggregate and broadcast the results is pretty slow. It is first pulling the results off of the GPU, copying to system memory, sending to other workers, then uploading to the other GPUs.
Thus, I would like to have a gplus() or labSend() that copies a gpuArray directly to the memory of another GPU on another worker.
Is this possible today? If not, is it something you are working on?