I order to speed up the code, I modified so that it runs in parallel on my GPU using Matlab's Parallel Processing Toolbox. In theory, I should get a huge speedup, since computation at each point in the volume is independent of every other point. However, I only get a very slight speedup. I am relatively new to Matlab's GPU computing interface, so I expect there is something wrong with my implementation.
I was wondering if I could get some idea of where I went wrong and how the code should be modified so that I can get the necessary speedup.
I attached the GPU and CPU versions (ad3d_gpu.zip) of the code to this post.