Improving performance of X = A \ B.

Question

4 votes

profileSolvers_1b.m

Last edit: section "EDIT 11/08/2016"

Dear all,

How would you improve the performance in the computation of:

X = A \ B ;

where A is a n x n non-symmetric sparse matrix, B a n x m array, and X a n x m array? It represents m solutions corresponding to m emission scenarios in my model (compartmental model of size n). Typical values for n are in the range 1e5-1e6, and m can be adjusted: I have e.g. 1e4 scenarios that I solve by group/block. I provide an example with n~1e5 and m=50 which is good for profiling (~3s on my laptop): MAT-File, stored externally because it is slightly larger than 5MB.

Array B is generally dense, and matrix A has a density of ~6.5e-5 and the following structure:

Given the size in memory of these arrays (~11MB for A and 1KB for B), I thought that they could be transferred to the GPU and gathered at little cost, but when I try I get:

 Error using  \ 
 Sparse gpuArrays are not supported for this function.

I saw other threads that mention that \ is not yet? supported for sparse arrays (e.g. ref1, ref2).

I tried using TomLab TLSQR but with little success (TomLab may have another tool for this, but I didn't find it).

I also tried using Tim Davis' SuiteSparse UMFPACK. I think that MATLAB implemented part of this code, so I wasn't surprised to see the same performance between MATLAB MLDIVIDE and UMFPACK. I saw that SuiteSparse can be compiled with GPU support, but I wanted to ask here before going this way.

I didn't explore other, maybe more algebraic options, transformations, etc yet. Does anything jump to your mind? The A matrix is defined by block (6 x 6, not delineated here but "somehow visible"). The filling of each block can vary, but the density is rather stable.

Thank you,

Cedric

(MATLAB 2016b Windows/Linux)

EDIT 10/31/2016

I have to add that I implemented explicitly a parallel approach working with cell arrays of blocks of both X and B, i.e. something like:

 X_blocks = cell( size( B_blocks )) ;
 parfor k = 1 : numel( B_blocks )
    X_blocks{k} = A \ B_blocks{k} ;
 end

but I thought that, given what is implemented in MLDIVIDE is a Multi-frontal method, the parallelization could happen at a much lower level using either/both a parallel pool or/and a GPU device(?) Or is it possible to use distributed ( ref ) in some way to avoid repeated factorization of A ( ref )?

EDIT 11/02/2016

After giving a try to iterative solvers (see Sean part of the thread), I went on trying direct approaches. I am attaching a test script that profiles MLDIVIDE, MLDIVIDE parallel by block, FACTORIZE (by Tim Davis), FACTORIZE parallel by block, and an explicit implementation of what factorize does for non-symmetric sparse matrices ( LU -> L, U, P, Q, R and then relevant series of MLDIVIDE). I obtain the following results:

 Test simple MLDIVIDE (\) on large B ..
  - Total time       : 14.5s
  - Max rel. residual: 6.6e-16
 Test parallel MLDIVIDE (\) on cell B ..
  - Total time       : 11.0s
  - Max rel. residual: 6.6e-16
 Test simple FACTORIZE on large B ..
  - Time factorize   : 1.0s
  - Total time       : 10.5s
  - Max rel. residual: 3.1e-15
 Test parallel FACTORIZE on cell B ..
  - Time factorize   : 1.0s
  - Total time       : 9.0s
  - Max rel. residual: 3.1e-15
 Test simple LU, etc on large B ..
  - Time factorize   : 0.9s
  - Total time       : 10.2s
  - Max rel. residual: 3.1e-15
 Test parallel LU, etc on cell B ..
  - Time factorize   : 0.9s
  - Total time       : 8.9s
  - Max rel. residual: 3.1e-15

where the rational for using FACTORIZE and the explicit formula (LU, etc) is that matrix A is the same for all blocks (when parallelized), and should not be repeatedly factorized. This approach works best (especially with my real, much larger, B arrays), but I'd still be interested to know if it is possible to do better with iterative approaches and possibly on GPU(?)

EDIT 11/08/2016

I performed quick tests overnight that I am summarizing here. I observed that the block/parallel approach ( X{k}=Q*(U\(L\(P*(R\B_cell{k})))) in a PARFOR loop, after explicitly factorizing A ) is almost always more time efficient than direct MLDIVIDE ( solving X=A\B in one shot), and I wanted to test it as a function of the system size and the number of columns in B. My test script loops through system sizes from 100 to 90,000, and through numbers of columns of blocks of B from 1 to 100 ( B itself has 200 columns). For each system size, I measure the time for solving X=A\B in one shot, and the time for solving it by block for the aforementioned block sizes. The code for solving by block is as follows:

 B_cell = mat2cell( B, B_nRows, Bblock_nCols ) ;
 X      = cell( size( B_cell )) ;
 parfor k = 1 : numel( B_cell )
     X{k} = Q * (U \ (L \ (P * (R \ B_cell{k})))) ;
 end 
 X = [X{:}] ;

where factors of A were obtained as follows outside of the loop that iterates through block sizes:

[L, U, P, Q, R] = lu( A ) ;

I obtain the following results:

Aside from the fact that it is a valid, lengthy approach for generating a sunset landscape, it strikes me that there is no clear pattern with block size. I thought that I would observe something function of the size of the cache, because the maximum block size is ~9e4*100*8/1e6 -> 72MB > 8MB Xeon 1505M cache. The second interesting point is that direct solving is almost always slower than an explicit block/PARFOR approach (plot on the right). Finally, it seems that overall "PARFOR-ing" a lot with small bock sizes is most efficient (bottom plot: for each system size, I compute times/max(times), and I plot the mean over all system sizes).

This result is a bit counter-intuitive to me; I thought that I would get a pattern that would allow me to determine the maximum block size for each system size, as a function of the amount of cache in the CPU, and that this max block size would be optimal regarding time efficiency of the block/PARFOR approach.

Any thoughts/insights?

OTHER REFERENCES

Ref : Jill mentions factorization of A.
Ref : Jill mentions distributed.speye.
Ref : Doc of SPPARMS, with mention of the UMFPACK parameter (60 times slower than default in my case).
Ext ref : The Multifrontal Method for Sparse Matrix Solution: Theory and Practice, Joseph W. H. Liu, SIAM Review, Vol. 34, No. 1 (Mar., 1992), pp. 82-109.

1 Comment
Show -1 older comments Hide -1 older comments

Cedric on 22 Sep 2017

Similar thread for the symmetric case: https://www.mathworks.com/matlabcentral/answers/357776-is-it-possible-to-speed-up-solving-ax-b-for-multiple-b-s-by-pre-computing-the-cholesky-factorization

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Sean de Wolski on 31 Oct 2016

0 votes

Rather than calling \, you can call bicg and gmres on sparse gpuArrays.

6 Comments
Show 4 older comments Hide 4 older comments

zawye aung on 2 Oct 2020

Open in MATLAB Online

Hello! Sir, i have a trouble. Could you hlep me.

N = 4 ;
A = rand(N,N) ;B = rand(N,1) ;
% U = A\B;disp(U);
% B_blocks = mat2tiles(B,100,1);
% B_blocks = mat2cell(B, [2000 2000 2000 2000 2000], [10000 1]);
% celldisp(C1);
% A_cell = mat2cell(A, [2 2], [2 2]);
B_cell = mat2cell(B, [2 2], [1]);
 X      = cell( size( B_cell )) ;
 parfor k = 1 : numel( B_cell )
     X{k} = A\ B_cell{k} ;
 end 
 X = cell2mat(X);
 disp(X);
%%%%%%%%%%%%%%%%%%%

Error using manager (line 10)

Matrix dimensions must agree.

Walter Roberson on 16 Nov 2020

You are trying to \ between an N x N array, A, and an (N-divided-by-something) by 1 array B_cell{k} .

Sign in to comment.

Improving performance of X = A \ B.

1 Comment
Show -1 older comments Hide -1 older comments

Answers (1)

6 Comments
Show 4 older comments Hide 4 older comments

Categories

Products

Tags

Community Treasure Hunt

Improving performance of X = A \ B.

1 Comment Show -1 older comments Hide -1 older comments

Answers (1)

6 Comments Show 4 older comments Hide 4 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

6 Comments
Show 4 older comments Hide 4 older comments