you might like to have a look at the following article:
in those results, the achieved transfer bandwidth tops out at about 5.7GB/sec (send) and 4.0GB/sec (gather). Whilst I can't give you a definitive answer as to why your measured transfer rates are so low and unreliable, here are a couple of points to consider:
- the second "wait(gpu)" inside your tight loop is not needed and will be affecting results. Memory transfers from device to host (i.e. "gather") are always synchronized.
- You are measuring the speed of transferring data to/from the GPU (i.e. the speed of the PCI bus). This is not the same as the GPU memory bandwidth (as suggested by the question title), which is much, much higher (>90GB/sec for your GPU and even higher for a recent GPU).
- it is nearly impossible to accurately measure the transfer bandwidth from within MATLAB. What you are actually timing here is the time taken to allocate some space (on the GPU in the first case, in host memory for the second), to perform the data-transfer and to assign a MATLAB variable. These extra steps take some (hopefully small) amount of time that will reduce the results.
- some of the variability may come from other processes using the PCI bus. Running your OS in a highly stripped-down mode with no network etc. might help.
If you try the code from the article and still see much lower results, let me know. Note, however, that you are not really measuring your GPU here, you are simply measuring how busy your PCI bus is and how well MATLAB can throw data at it. It's an important measure, but it's not usually the most important one, so long as you do plenty of calculations with your data once you've put it on the GPU. If you want to know more about your GPU's calculation performance, you might like to take GPUBench for a spin: