Follow up questions to be asked
1. Do you observe a delay while assigning gpuArrays and doing some basic calculations ? or do you observe this delay only while using the command gpuDevice(1) ?
2. Use either use gputimeit or wait(gpuDevice) to make sure you are getting accurate timings
If the issue is observed only with gpuDevice(1), refer to the related article (also described below):
Pascal architecture cards face this issue as it does not have enough cache allocated to it for all our libraries to be re-optimized for the unfamiliar architecture. In order to fix this please set an environment variable "CUDA_CACHE_MAXSIZE " on the machine to some high value like 1GB. By default "CUDA_CACHE_MAXSIZE" is 32MB.
In Windows you can do this in properties > advanced system settings > environment variables. In order to set the cache to 1GB use CUDA_CACHE_MAXSIZE 1073741824.Our tests suggest that the cache needs 445MB on Linux but we are not sure on Windows. If 1GB is too much disk space to lose you can reduce this number to 500MB. The computer may need to be restarted and 1 gpuDevice call executed for the cache to be saved correctly and the operation become a 1 time only performance drop.
Pascal cards are officially supported by CUDA 8.0 and higher and this performance drop will be visible in any MATLAB built with an earlier version of CUDA than 8.0.