Most Efficient Way to Construct the Matrices to Extract the Lower and Upper Triangle from a Vectorized Matrix

Given a matrix X and its vector form I am after the most efficient way to build the matrices L and U which extracts the lower and upper triangle from X.
So in MATLAB code it would be something like that:
clear();
numRows = 3;
numCols = numRows;
mX = randn(numRows, numCols);
vX = mX(:);
% Lower Triangle are indices 2, 3, 6
mL = [ 0, 1, 0, 0, 0, 0, 0, 0, 0 ; ...
0, 0, 1, 0, 0, 0, 0, 0, 0 ; ...
0, 0, 0, 0, 0, 1, 0, 0, 0 ];
% Upper Triangle are indices 4, 7, 8
mU = [ 0, 0, 0, 1, 0, 0, 0, 0, 0 ; ...
0, 0, 0, 0, 0, 0, 1, 0, 0 ; ...
0, 0, 0, 0, 0, 0, 0, 1, 0 ];
assert(isequal(mL * vX, mX(logical(tril(mX, -1)))));
assert(isequal(mU * vX, mX(logical(triu(mX, 1)))));
I am after sparse represenation of mU and mL in the most efficient way.
My current implementation is given by:
function [ mLU ] = GenerateTriangleExtractorMatrix( numRows, triangleFlag, diagFlag )
EXTRACT_LOWER_TRIANGLE = 1;
EXTRACT_UPPER_TRIANGLE = 2;
INCLUDE_DIAGONAL = 1;
EXCLUDE_DIAGONAL = 2;
switch(diagFlag)
case(INCLUDE_DIAGONAL)
numElements = 0.5 * numRows * (numRows + 1);
diagIdx = 0;
case(EXCLUDE_DIAGONAL)
numElements = 0.5 * (numRows - 1) * numRows;
diagIdx = 1;
end
vJ = zeros(numElements, 1);
if(triangleFlag == EXTRACT_LOWER_TRIANGLE)
elmntIdx = 0;
for jj = 1:numRows
for ii = (jj + diagIdx):numRows
elmntIdx = elmntIdx + 1;
vJ(elmntIdx) = ((jj - 1) * numRows) + ii;
end
end
elseif(triangleFlag == EXTRACT_UPPER_TRIANGLE)
elmntIdx = numElements + 1;
for jj = numRows:-1:1
for ii = (jj - diagIdx):-1:1
elmntIdx = elmntIdx - 1;
vJ(elmntIdx) = ((jj - 1) * numRows) + ii;
end
end
end
mLU = sparse(1:numElements, vJ, 1, numElements, numRows * numRows, numElements);
end
Is there a more efficient way to generate vJ without extensive allocation of memory (In order to allow generating really large matrices)?
Thank You.

24 Comments

Won't guarantee most efficient, but how about this?
function triangles
numRows = 1000;
numCols = numRows;
mX = randn(numRows, numCols);
vX = mX(:);
li = 1:numRows < (1:numRows)';
ln = sum(li(:));
mL = sparse(1:ln, find(li), 1, ln, numRows*numRows);
ui = 1:numRows > (1:numRows)';
un = sum(ui(:));
mU = sparse(1:un, find(ui), 1, un, numRows*numRows);
assert(isequal(mL * vX, mX(logical(tril(mX, -1)))));
assert(isequal(mU * vX, mX(logical(triu(mX, 1)))));
end
time test:
>> timeit(@triangles)
ans =
0.0827 % sec
This is similar to what I have now.
But it allocates a lot and using find() is probably not the most efficient way. Think of the case numRows = 50000. You created 2 sqaure matrices with size 50,000 x 50,000. I am after something that creates the indices vetor more efficiently.
I will say, after I read your comment the problems with my first suggestion are now pretty dang obvious...
At the risk of proposing another dud, I'll offer this:
function T = func(numRows)
numCols = numRows;
mX = randn(numRows, numCols);
vX = mX(:);
[mL, mU] = triangles;
assert(isequal(mL * vX, mX(logical(tril(mX, -1)))));
assert(isequal(mU * vX, mX(logical(triu(mX, 1)))));
function [mL, mU] = triangles
N = (numRows*(numRows-1))/2;
N2 = numRows*numRows;
idx = zeros(N,1);
begin = 1;
for i = 1:numRows
range = ((i-1)*(numRows+1) + 2):(i*numRows);
n = numel(range);
idx(begin:begin+n-1) = range;
begin = begin+n;
end
mL = sparse(1:N, idx, 1, N, N2);
mU = sparse(1:N, N2+1-flip(idx), 1, N, N2);
end
T = timeit(@triangles);
end
>> func(10000)
ans =
3.7114 % sec
>> func(20000)
ans =
27.0066 % sec
Hi Tommy,
Look at my updated question. I added my own solution.
As you can see I allocate the minimal amount of data. For 1000 elements it is slower than yours. But I wonder what happens for really large cases.
Is there a more efficient way to generate vJ without extensive allocation of memory (In order to allow generating really large matrices)?
Even if there were, it doesn't look like you can get more than about 25% reduction in run time. As a quick profiling of the code will tell you, more than 75% of the run time is consumed by the call to sparse().
Well, 25% of a lot of a time is a lot to gain :-). Still, even as a challange. I wonder how efficient can we get. Probably there is a pattern to do it faster.
Well, 25% of a lot of a time is a lot to gain
Is it, though? Whether it runs for 12 hours or 9 hours, you're still going to be letting it run overnight.
You say you have never tried to shave ~10% of the run time of a program :-)? Anyho,w take it as a challange. Could we find better, performance wise, pattern to generate vJ?
How will these be used downstream? Are you actually going to be doing the mL * vX or mU * vX multiples downstream? (In which case you probably want the mL and mU to be double sparse). Or would you want mU and mL to be logical sparse? (less storage as long as you don't subsequently turn it into a double sparse downstream)
@James, You actually raised a good point. It might be simpler to use logical data class. They will be used as multipliers of vectors in a definition of optimization problem. I wonder if optimizers (Think MOSEK or Gurobi) support sparse matrices which are not double or single. Especially in the case of multiplication by a vector which has different class.
If you actually do the multiply, I suspect (although I have not tested) that the logical is first converted to a double before the multiply is done. If true, this would completely negate the original savings from the logical storage (and actually cost you more memory). So double sparse storage is probably best if you do the multiply.
That being said, why bother even doing these multiplies downstream if you already know the answer? I would just have a routine generate the answer to mU * vX directly without generating mU or doing the multiply, but maybe that is not an option for you if you need to pass this to an optimizer that is programmed to do a multiply. Is that the situation?
I will look into a mex routine that will generate mU or mL directly in sparse format.
If they are to be used in an optimization problem, are you certain you need the operation in matrix form? Why would operator form not be good enough?
Yes @James, this is what I think happens. But I think it is a feature worth recommend for solvers developers. Allow matrices in logical form as many constraints are basically selecting items.
@Matt, Do you know big solvers which supports form of operators instead of matrices? Take linprog() or quadprog() for instance, I don't think they accept anything but matrices to express linear constraints.
Take linprog() or quadprog() for instance, I don't think they accept anything but matrices to express linear constraints.
This might be an instance where you would benefit from using fmincon, with a nonlinear constraint function to express the constraint even though it is really linear. This would allow you to implement mL*x and mL*U in operator form. You would of course supply an analytical gradient and, if possible, a feasible intial guess, x0.
Eliminating the need for mL and mU is, in any case, the only thing that is likely to change the game signficantly. Although, I now wonder whether the one-time step of building the mL and mU matrices is significant next to the time taken for the optimization iterations to converge. Building vJ would be still less significant. Your 25% improvement is now a 2% improvement at best, I would guess.
@Matt, there are many cases for using those matrices. I just gave one. Anyhow, I think the subject of the question is about building them faster. Anyone could use them the way they need.
Regarding fmincon(), Solving problem which have dedicate solver with general solver is usually a really bad chice. Not to speak defining Linear Constarint in the form of Non Linear (Think time spent on calculating the Derivataive when it is so well defined).
For an optimization problem, it appears to me that you would know the form of the constraint up front, so the sizes are set and the form of mU or mL doesn't change during the iterations. So this boils down to an up front memory allocation issue, and not really a speed issue.
@Matt, there are many cases for using those matrices.
I can't think of any. You can pursue this for hypothetical interest if you want, of course.
Regarding fmincon(), Solving problem which have dedicate solver with general solver is usually a really bad chice.
The situations when that is true are those where calculating the objective and derivatives are faster in matrix form than in operator form. For large problems that won't always be the case, because the computational cost of implementing things in matrix form can start to outweigh the benefits of using a specialized algorithm.
Not to speak defining Linear Constarint in the form of Non Linear (Think time spent on calculating the Derivataive when it is so well defined).
I don't think mU and mL are helpful for defining either linear or nonlinear constraints. A linear constraint on the lower triangle of your unknown matrix X will always be of the form sum(T.*X, 'all')<=b, where T is some lower triangular matrix that you know in advance. The matrix form of the constraint gradient is simply T(:), which doesn't require mL at all to set up.
For non-linear constraints c(mL*X)<=0 on the lower triangular part of X, the gradient can be expressed mL.'*gradc(mL*X), but this could implemented efficeintly and without mL as follows:
B=B=tril(true(numRows),-1);
Bd=double(B);
g=Bd;
g(B)=gradc(X(B));
So, for this, you really only need to pre-compute B and Bd, which can be done with much less time and memory allocation than mL:
N=3000;
tic;
B=tril(true(N),-1);
Bd=double(B);
toc
%Elapsed time is 0.056492 seconds.
tic;
mL=GenerateTriangleExtractorMatrix( N, 1, 2);
toc
%Elapsed time is 0.264385 seconds.
>> whos B Bd mL
Name Size Kilobytes Class Attributes
B 3000x3000 8790 logical
Bd 3000x3000 70313 double
mL 4498500x9000000 140602 double sparse
@Matt,
You know what, let's give it a try. I will build a problem which involves some constraint on the triangle of the matrix. We'll solve it with fmincon() with efficient writing of the constraint as you suggested. You will write such function in MATLAB. We'll compare it against commercial solver which asks for the constraint in matrix form. We'll see if it useful or not to have it. If the Commercial Solver will be faster, it is useful. If not, it means that using general solver with simpler form of the constraint is better. It will be interesting. I agree.
We'll compare it against commercial solver which asks for the constraint in matrix form...If the Commercial Solver will be faster, it is useful.
But as I mentioned in my last comment, even providing linear constraints in matrix form should not require mL and mU. So I don't know what this comparison could prove. If you wanted to do the comparison for non-linear constraints, that might be interesting...
Commercial solvers, At least as far as I can see, require constraints in a Matrix form.
Even linprog() and quadprog() are like that.
Yes, but the matrix you provide should not require mL to set up. Each row of the A(i,:) of the constraint matrix should simply be T_i(:).' where T_i is some row-dependent lower triangular matrix the same size as your unknown matrix X. I don't see why you would need mL as part of the process for generating T_i. You sould be able to set them up more efficiently in the space of X.
I am not sure what you mean. But the mL and mU are exactly that. Each of their rows extracts a single element of vectroized matrix. I guess by T_i(:) you meant for the triangle matrix where single index of the trinagle is 1 and the rest is zero. This is exactly mU and mL from above.
But that would mean your constraints are of the form mL*X(:)<=b. But since each row of mL contains only a single non-zero element, this means the constraint is equivalent to a simple bound X(j)<=b. In Matlab, you would never have to construct a matrix to represent such a constraint. You would use the vector input arguments lb and ub to specify those. I assume Gurobi has something similar.
@Matt, I know that. Whenever I can use other features of the solver I user. I have cases I need those extractors in Matrix Form. I appericiate the dialogue. I think other who will read it will gain something. I still hope someone will bring a different point of view to the pattern of vJ. Though I guess @James' solution as practically as good as it gets (Also appericate if there is something to make it even faster).

Sign in to comment.

 Accepted Answer

Here is a mex routine that generates the sparse double matrices mL and mU directly, so no wasted memory in creating them. Seems to run about 3x-5x faster than m-code for somewhat large sizes.
/* S = GenerateTriangleExtractorMatrixMex(numRows,triangleFlag,diagFlag)
*
* S = double sparse matrix
* numRows = integer > 0
* triangleFlag = 1 , extract lower triangle
* 2 , extract upper triangle
* diagFlag = 1 , include diagonal
* 2 , exclude diagonal
* where
*
* M = an numRows X numRows matrix of non-zero terms
* assert(isequal(S * M(:), mX(logical(tril(M, -1))))); % for lower
* assert(isequal(S * M(:), mX(logical(triu(M, 1))))); % for upper
*
* Programmer: James Tursa
* Date: 2020-April-22
*/
#include "mex.h"
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize numRows, triangleFlag, diagFlag, numElements;
mwIndex *Ir, *Jc;
mwIndex i, j, k, m;
double *pr;
if( nrhs != 3 || !mxIsNumeric(prhs[0]) || !mxIsNumeric(prhs[1]) || !mxIsNumeric(prhs[2]) ||
mxGetNumberOfElements(prhs[0]) != 1 || mxGetNumberOfElements(prhs[1]) != 1 ||
mxGetNumberOfElements(prhs[2]) != 1 ) {
mexErrMsgTxt("Need three numeric scalar inputs");
}
if( nlhs > 1 ) {
mexErrMsgTxt("Too many outputs");
}
numRows = mxGetScalar(prhs[0]);
triangleFlag = mxGetScalar(prhs[1]);
diagFlag = mxGetScalar(prhs[2]);
if( numRows < 1 ) {
mexErrMsgTxt("Invalid numRows, should be > 0");
}
if( triangleFlag != 1 && triangleFlag != 2 ) {
mexErrMsgTxt("Invalid triangleFlag, should be 1 or 2");
}
if( diagFlag != 1 && diagFlag != 2 ) {
mexErrMsgTxt("Invalid diagFlag, should be 1 or 2");
}
if( diagFlag == 1 ) {
numElements = numRows * (numRows + 1) / 2; /* include diagonal */
} else {
numElements = (numRows - 1) * numRows / 2; /* exclude diagonal */
}
plhs[0] = mxCreateSparse(numElements, numRows*numRows, numElements, mxREAL);
pr = (double *) mxGetData(plhs[0]);
Ir = mxGetIr(plhs[0]);
Jc = mxGetJc(plhs[0]);
Jc[0] = 0;
diagFlag--;
k = 0;
m = 1;
if( triangleFlag == 1 ) { /* Lower */
for( j=0; j<numRows; j++ ) {
for( i=0; i<numRows; i++ ) {
if( i >= j+diagFlag ) {
*pr++ = 1.0;
*Ir++ = k++;
Jc[m] = Jc[m-1] + 1;
} else {
Jc[m] = Jc[m-1];
}
m++;
}
}
} else { /* Upper */
for( j=0; j<numRows; j++ ) {
for( i=0; i<numRows; i++ ) {
if( i+diagFlag <= j ) {
*pr++ = 1.0;
*Ir++ = k++;
Jc[m] = Jc[m-1] + 1;
} else {
Jc[m] = Jc[m-1];
}
m++;
}
}
}
}
You mex the routine as follows (you need a supported C compiler installed):
mex GenerateTriangleExtractorMatrixMex.c
And some test code:
% GenerateTriangleExtractorMatrix_test.m
n = 300;
disp('m-code timing')
tic
GenerateTriangleExtractorMatrix(10000,1,1);
toc
disp('mex code timing')
tic
GenerateTriangleExtractorMatrixMex(10000,1,1);
toc
for k=1:n
numRows = ceil(rand*5000+100);
numCols = numRows;
triangleFlag = (rand<0.5) + 1;
diagFlag = (rand<0.5) + 1;
Mm = GenerateTriangleExtractorMatrix(numRows,triangleFlag,diagFlag);
Mx = GenerateTriangleExtractorMatrixMex(numRows,triangleFlag,diagFlag);
if( ~isequal(Mm,Mx) )
error('Not equal');
end
end
disp('Random tests passed')
With a sample run:
>> GenerateTriangleExtractorMatrix_test
m-code timing
Elapsed time is 9.964882 seconds.
mex code timing
Elapsed time is 1.901741 seconds.
Random tests passed

4 Comments

Nice! I was not aware how easy it is to work with sparse Matrices in Mex with MATLAB. I wish I could use this goodies in a production C code without dependency on MATLAB. Or maybe I can? Could I create a standalone DLL with the mex.h library?
"... I was not aware how easy it is to work with sparse Matrices in Mex with MATLAB ..."
Well, I would not say it was easy ... I just happen to have a lot of experience with them so I know how to write correct C code for it. What is easy to do is mess it up and create invalid sparse matrices.
Regarding non-MATLAB code, you could write generic C code for the above sparse format without it being tied to an mxArray. Or it wouldn't be too hard to morph it into a different sparse format. Just depends on what you need (or more precisely, what the optimizer code will accept).
@James, I meant I want to use all the abstractions of the MATLAB C API. I just want to use it in my own C files. Not for MEX but for general computing. Yet I guess MATLAB blocks that kind of use.
By the way, I tried optimizing the code:
if( triangleFlag == 1 ) { // Lower Triangle
for( jj = 1; jj < numRows + 1; jj++ ) {
for( ii = 1; ii < jj + diagFlag; ii++ ) {
ll++;
Jc[ll] = Jc[ll - 1];
}
for( ii = jj + diagFlag; ii < numRows + 1; ii++ ) {
ll++;
Jc[ll] = Jc[ll - 1] + 1;
vV[kk] = 1.0;
Ir[kk] = kk;
kk++;
}
}
} else { // Upper Triangle
for( jj = 1; jj < numRows + 1; jj++ ) {
for( ii = 1; ii < jj + 1 - diagFlag; ii++ ) {
ll++;
Jc[ll] = Jc[ll - 1] + 1;
vV[kk] = 1.0;
Ir[kk] = kk;
kk++;
}
for( ii = jj + 1 - diagFlag; ii < numRows + 1; ii++ ) {
ll++;
Jc[ll] = Jc[ll - 1];
}
}
}
But for some reason even removing the branching inside the loop didn't improve results.
Really Nice! If nothing comes up I will mark this as an answer. Thank You!

Sign in to comment.

More Answers (2)

Another approach to consider is to use my MatrixObj class
to construct an object that has the same effect as the operations mL*X and mL.'*Y, but doesn't require you to actually build the matrix,
N=5000;
tic;
mL0=GenerateTriangleExtractorMatrix( N, 1, 2);
toc
%Elapsed time is 0.678702 seconds.
tic;
B=tril(true(N),-1);
Bd=double(B(:));
mL=MatrixObj;
mL.Params.B=B;
mL.Params.Bd=Bd;
mL.Ops.mtimes=@(obj,z) z(obj.Params.B);
mL.Trans.mtimes=@mtimesT;
toc;
%Elapsed time is 0.086228 seconds.
function out=mtimesT(obj,z)
out=obj.Params.Bd;
out(obj.Params.B)=z;
end
In addition to requiring less time to construct, you can verify that it gives the same results as multiplications with mL and mL.',
>> X=rand(N^2,1); isequal(mL0.'*(mL0*X),mL.'*(mL*X))
ans =
logical
1
but with considerably less memory consumption:
>> whos mL mL0
Name Size Kilobytes Class Attributes
mL 1x1 219739 MatrixObj
mL0 12497500x25000000 390586 double sparse
My current solution:
function [ mLU ] = GenerateTriangleExtractorMatrix( numRows, triangleFlag, diagFlag )
EXTRACT_LOWER_TRIANGLE = 1;
EXTRACT_UPPER_TRIANGLE = 2;
INCLUDE_DIAGONAL = 1;
EXCLUDE_DIAGONAL = 2;
switch(diagFlag)
case(INCLUDE_DIAGONAL)
numElements = 0.5 * numRows * (numRows + 1);
diagIdx = 0;
case(EXCLUDE_DIAGONAL)
numElements = 0.5 * (numRows - 1) * numRows;
diagIdx = 1;
end
vJ = zeros(numElements, 1);
if(triangleFlag == EXTRACT_LOWER_TRIANGLE)
elmntIdx = 0;
for jj = 1:numRows
for ii = (jj + diagIdx):numRows
elmntIdx = elmntIdx + 1;
vJ(elmntIdx) = ((jj - 1) * numRows) + ii;
end
end
elseif(triangleFlag == EXTRACT_UPPER_TRIANGLE)
elmntIdx = numElements + 1;
for jj = numRows:-1:1
for ii = (jj - diagIdx):-1:1
elmntIdx = elmntIdx - 1;
vJ(elmntIdx) = ((jj - 1) * numRows) + ii;
end
end
end
mLU = sparse(1:numElements, vJ, 1, numElements, numRows * numRows, numElements);
end
I like the memory allocation is kept to a minimum.
I wonder if there is a more efficient way to generate vJ. It is trivial to remove the inner loop and just count the number of elements yet in MATLAB it will mean each iteration will allocate memory (As we don't have iterators).

2 Comments

The two methods are fairly similar - I also like that yours minimizes memory allocation. I ran a few simple fun tests:
I didn't dare try higher than 20,000 for numRows. It seems that your code may possibly perform better at higher values of numRows. In the second case (calculating both the upper and lower triangles) I had your code running both sets of for loops, one after the other (shown in red). In green is the result from your code if only the first set of loops runs, and you recognize that vJ for one triangle is easy to determine if you have vJ for the other triangle (N^2+1-flip(vJ)). So the only thing I'll conclude from this is, if you will eventually calculate both the lower and upper triangle matrices for a given size, it might be better to calculate them together and only find vJ once. I suppose it depends on how expensive N^2+1-flip(vJ) is.
@Tommy, Really liked your analysis. Yes, when dealing with sparse matrices the whole point it making sure allocation is kept to minimum. I agree if one wants both, it is better to do the trick you mentioned.
Let's see if someone can think on a different pattern to populate vJ which is more efficient.

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!