If the kernel is doing little work, then the overhead of memcpy and
kernel launches can offset any performance gains. Consider working on a larger sample set
(thus increasing the loop size). To detect this condition, look at the
nvvpreport.
Do more work in the loop or increase sample set size
If there are too many local/temp variables used in the loop body, then it causes high
register pressure in the per-thread register file. You can detect this condition by
running in GPU safe-build mode. Or, nvvp reports this fact.
Consider using different block sizes in coder.gpu.kernel pragma.