Parallel matrix operations on GPU using arrayfun, how to accelerate for big matrix?
Show older comments
I am attempting to use GPU to accelerate for loop problem.
The process is like this:
for a n^2*n^2 matrix,
step 1: reshape each column n^2 element to a n*n matrix,
step 2: do a 2d discrete sine transform,
step 3: reshape back to n^2*n^2 matrix.
Now my computer takes 4.23 second for this matrix muliplication. My question is how to improve the speed?
The code is like this:
% input
n_square=10000;
A= rand(n_square, n_square);
tic
B=bigMatrixMuliplication(A);
toc
% core function
function output=bigMatrixMuliplication(input)
n_square = size(input,1);
n=sqrt(n_square);
output=zeros(n_square,n_square);
for ii=1:n_square
output(:,ii) = reshape(dst2(reshape(input(:,ii),n,n)),n_square,1); %step 1,2,3
end
end
% 1 d discrete sine transform
function y = dst1(x) %1 d discrete sine transform
n = size(x,1); m = size(x,2);
y = [zeros(1,m);x]; y = -imag(fft(y,2*n+2));
y = sqrt(2/(n+1))*y(2:n+1,:);
end
% 2 d discrete sine transform
function y = dst2(x) %2 d discrete sine transform
y = dst1(dst1(x).').';
end
One of my idea is using parfor, but result is similar, also about 5 seconds. Another idea is split this matrix into 3d matrix, for example 10000*10000 matrix into 100*100*10000 matices. And use GPU to calculate 10000 matrices with size of 100*100 in parallel. But arrayfun on GPU is even slower than by CPU. Someone maybe has idea how to do it? Thank you very much!
A = rand(100, 100, 5000);
A1 = gpuArray.rand(100, 100, 5000);
B1= 1:5000;
tic
B=arrayfun(@(i)dst2(A(:,:,i)), 1:size(A,3),'un',0);
time1=toc
tic
B=arrayfun(@(i)dst2(A1(:,:,i)), B1,'un',0);
time2=toc
Answers (2)
I believe you can accelerate this considerably using a combination of my FEX tools FUNC2MAT ( Download ) and KronProd ( Download ).
First, obtain a 1D DST matrix
M=gpuArray( func2mat(@dst1,ones(n,1),'doSparse',0) ); % n x n DST matrix
Then build a KronProd object to apply it in 2D
K=KronProd({M,M});
and now simply multiply
input=gpuArray(input);
tic;
output = K*input;
toc
Note that the relevant performance measure is the time for the multiplication alone. The construction of the K-object is a one-time cost.
1 Comment
Here's a simplified experiment mimicking your data sizes. The execution times that I get on my GPU (Titan X) are well below 4.23 sec.
>> M=gpuArray(rand(100));
>> K=KronProd({M,M});
>> input=gpuArray.rand(1e4);
>> gputimeit(@() K*input)
ans =
0.2863
Jan
on 2 Oct 2017
0 votes
I've played with the code. Inlining the contents of dst1 into dst2 or inlining both function into bigMatrixMuliplication might save some milliseconds. But the main work remains in -imag(fft(y,2*n+2)) and the copying of the values between the arrays.
It is a pitty to calculate the full fft, if you crop the results afterwards. Perhaps a hand-made fft version can avoid this.
The columns are independent from each other, therefore I'd assume that parfor accelerates the code. Please post, how you have implemented it and on how many cores you run it. A limitation might be, that fft could be multi-threaded already. Then all cores are loaded and cannot be used for a parallelization.
Categories
Find more on Time-Frequency Analysis in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!