Asked by Cedric Wannaz
on 28 Oct 2016

Last edit: section "EDIT 11/08/2016"

Dear all,

How would you improve the performance in the computation of:

X = A \ B ;

where A is a n x n non-symmetric sparse matrix, B a n x m array, and X a n x m array? It represents m solutions corresponding to m emission scenarios in my model (compartmental model of size n). Typical values for n are in the range 1e5-1e6, and m can be adjusted: I have e.g. 1e4 scenarios that I solve by group/block. I provide an example with n~1e5 and m=50 which is good for profiling (~3s on my laptop): MAT-File, stored externally because it is slightly larger than 5MB.

Array B is generally dense, and matrix A has a density of ~6.5e-5 and the following structure:

Given the size in memory of these arrays (~11MB for A and 1KB for B), I thought that they could be transferred to the GPU and gathered at little cost, but when I try I get:

Error using \

Sparse gpuArrays are not supported for this function.

I tried using TomLab TLSQR but with little success (TomLab may have another tool for this, but I didn't find it).

I also tried using Tim Davis' SuiteSparse UMFPACK. I think that MATLAB implemented part of this code, so I wasn't surprised to see the same performance between MATLAB MLDIVIDE and UMFPACK. I saw that SuiteSparse can be compiled with GPU support, but I wanted to ask here before going this way.

I didn't explore other, maybe more algebraic options, transformations, etc yet. Does anything jump to your mind? The A matrix is defined by block (6 x 6, not delineated here but "somehow visible"). The filling of each block can vary, but the density is rather stable.

Thank you,

Cedric

(MATLAB 2016b Windows/Linux)

EDIT 10/31/2016

I have to add that I implemented explicitly a parallel approach working with cell arrays of blocks of both X and B, i.e. something like:

X_blocks = cell( size( B_blocks )) ;

parfor k = 1 : numel( B_blocks )

X_blocks{k} = A \ B_blocks{k} ;

end

but I thought that, given what is implemented in MLDIVIDE is a Multi-frontal method, the parallelization could happen at a much lower level using either/both a parallel pool or/and a GPU device(?) Or is it possible to use distributed ( ref ) in some way to avoid repeated factorization of A ( ref )?

EDIT 11/02/2016

After giving a try to iterative solvers (see Sean part of the thread), I went on trying direct approaches. I am attaching a test script that profiles MLDIVIDE, MLDIVIDE parallel by block, FACTORIZE (by Tim Davis), FACTORIZE parallel by block, and an explicit implementation of what factorize does for non-symmetric sparse matrices ( LU -> L, U, P, Q, R and then relevant series of MLDIVIDE). I obtain the following results:

Test simple MLDIVIDE (\) on large B ..

- Total time : 14.5s

- Max rel. residual: 6.6e-16

Test parallel MLDIVIDE (\) on cell B ..

- Total time : 11.0s

- Max rel. residual: 6.6e-16

Test simple FACTORIZE on large B ..

- Time factorize : 1.0s

- Total time : 10.5s

- Max rel. residual: 3.1e-15

Test parallel FACTORIZE on cell B ..

- Time factorize : 1.0s

- Total time : 9.0s

- Max rel. residual: 3.1e-15

Test simple LU, etc on large B ..

- Time factorize : 0.9s

- Total time : 10.2s

- Max rel. residual: 3.1e-15

Test parallel LU, etc on cell B ..

- Time factorize : 0.9s

- Total time : 8.9s

- Max rel. residual: 3.1e-15

where the rational for using FACTORIZE and the explicit formula (LU, etc) is that matrix A is the same for all blocks (when parallelized), and should not be repeatedly factorized. This approach works best (especially with my real, much larger, B arrays), but I'd still be interested to know if it is possible to do better with iterative approaches and possibly on GPU(?)

EDIT 11/08/2016

I performed quick tests overnight that I am summarizing here. I observed that the block/parallel approach ( X{k}=Q*(U\(L\(P*(R\B_cell{k})))) in a PARFOR loop, after explicitly factorizing A ) is almost always more time efficient than direct MLDIVIDE ( solving X=A\B in one shot), and I wanted to test it as a function of the system size and the number of columns in B. My test script loops through system sizes from 100 to 90,000, and through numbers of columns of blocks of B from 1 to 100 ( B itself has 200 columns). For each system size, I measure the time for solving X=A\B in one shot, and the time for solving it by block for the aforementioned block sizes. The code for solving by block is as follows:

B_cell = mat2cell( B, B_nRows, Bblock_nCols ) ;

X = cell( size( B_cell )) ;

parfor k = 1 : numel( B_cell )

X{k} = Q * (U \ (L \ (P * (R \ B_cell{k})))) ;

end

X = [X{:}] ;

where factors of A were obtained as follows outside of the loop that iterates through block sizes:

[L, U, P, Q, R] = lu( A ) ;

I obtain the following results:

Aside from the fact that it is a valid, lengthy approach for generating a sunset landscape, it strikes me that there is no clear pattern with block size. I thought that I would observe something function of the size of the cache, because the maximum block size is ~9e4*100*8/1e6 -> 72MB > 8MB Xeon 1505M cache. The second interesting point is that direct solving is almost always slower than an explicit block/PARFOR approach (plot on the right). Finally, it seems that overall "PARFOR-ing" a lot with small bock sizes is most efficient (bottom plot: for each system size, I compute times/max(times), and I plot the mean over all system sizes).

This result is a bit counter-intuitive to me; I thought that I would get a pattern that would allow me to determine the maximum block size for each system size, as a function of the amount of cache in the CPU, and that this max block size would be optimal regarding time efficiency of the block/PARFOR approach.

Any thoughts/insights?

OTHER REFERENCES

- Ref : Jill mentions factorization of A.
- Ref : Jill mentions distributed.speye.
- Ref : Doc of SPPARMS, with mention of the UMFPACK parameter (60 times slower than default in my case).
- Ext ref : The Multifrontal Method for Sparse Matrix Solution: Theory and Practice, Joseph W. H. Liu, SIAM Review, Vol. 34, No. 1 (Mar., 1992), pp. 82-109.

Answer by Sean de Wolski
on 31 Oct 2016

Rather than calling \, you can call bicg and gmres on sparse gpuArrays.

Cedric Wannaz
on 2 Nov 2016

Thank you Sean and Joss. I tried iterative solvers GMRES, PCG, BICG, and BICGSTAB, but none is converging (even with maxit = 1000).

I tried a few conditioners but I have not been even close to successful. Given the filling displayed above, do you see any "obvious" sound approach for conditioning?

In the mean time, I went on with direct approaches, as explained in my edited question.

Royi Avital
on 14 Aug 2018

How does MATLAB's `pcg()` compares to Intel MKL's PCG implementation performance wise?

Cedric Wannaz
on 16 Aug 2018

Royi, you should post this as a new question, because this is an old thread.

Sign in to comment.

Opportunities for recent engineering grads.

Apply Today
## 1 Comment

## Cedric Wannaz (view profile)

Direct link to this comment:https://www.mathworks.com/matlabcentral/answers/309690-improving-performance-of-x-a-b#comment_486790

Sign in to comment.