Description 
3D trilinear interpolation using GPU..

Format:
vi = interp3_gpu(x, y, z, v, xi, yi, zi);
(Same as Matlab interp3's: VI = interp3(X,Y,Z,V,XI,YI,ZI)
see http://www.mathworks.com/help/matlab/ref/interp3.html)
Input "x/y/z" should be 1D array of the coordinate grid positions of the 3D matrix "v", see the example, while "xi/yi/zi" doesn't have this requirement.

Example:
(Same as Matlab's example, with minor change.)
[x,y,z,v] = flow(10);
[xi,yi,zi] = meshgrid(.1:.25:10, 3:.25:3, 3:.25:3);
vi = interp3_gpu(x(1,:,1),y(:,1,1),z(1,1,:),v,xi,yi,zi); % Get the correct x/y/z grid positions.
slice(xi,yi,zi,double(vi),[6 9.5],2,[2 .2]), shading flat % Matlab slices function doesn't accept single precision. So convert to double before displaying.

Result:
Can be 20 times faster than Matlab interp3.

Note:
(1) The first call is not so fast because it requires time to build a kernel. All other calls are fast because the kernel exists as a persistent variable in the function.
(2) In case you want a 32bit version, rebuild (nvcc ptx interp3_cuda.cu). Make sure you have CUDA installed. Also clear so the persistent variable for the kernel can be updated.
