Why do tall arrays not work with "lasso"?

5 views (last 30 days)
I get an error because of "lasso" function and the interplay to Tall Arrays/Datastore.
I use tall arrays for Feature Engineering. The next step is to do a huge amount of feature engineering. About 500.000 features. At this point, the tall arrays are written to the disk. Afterwards, the data of the datastore is called via the tall array command. The written data has then about 60 GB.
X_tall = tall(ds);
lambda = [0.5 0.75 1];
[B,FitInfo]=lasso(X_tall(:,2:end),X_tall(:,1),'Lambda',lambda,'Standardize',true);
Error:
Error using tall/gather (line 50)
Unable to access intermediate data in the temporary folder, most likely because it ran out of space. Clear space on the local
drive, or avoid operations that reorder tall arrays (such as SORT or indexing with a tall numeric column vector).
Learn more about errors encountered during GATHER.
Error in tall/lasso (line 73)
[xtx, xty, yty, n, p, muX, muY, sigX] = gather(xtx,xty,yty, n, p, muX, muY, sigX);
Error in lasso_20211103 (line 128)
[B,FitInfo]=lasso(X_tall(:,2:end),X_tall(:,1),'Lambda',lambda,'Standardize',true);
Caused by:
  Error in adding keys and values.
    Value max size exceeded for database
    C:\Users\blabla\AppData\Local\Temp\tp066aa292_7229_45f4_8cc9_a736b269cadb\ExecutionTask_tp3c5c16cd_c3a2_4347_b8d3_9500514faccb_720.db.
    (string or blob too big)

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 6 Dec 2021
The tall framework is not designed to work on out-of-memory wide data. In some cases (other tall functions) you may be able to get it to work by modifying the partition size. In the case of the lasso function, we need to calculate the covariance matrix.
Here it is recommended to use FITRLINEAR without tall arrays, in FITRLINEAR use the SPARSA solver. FITRLINEAR does not try to compute the whole covariance matrix. It is possible that your predictor matrix fits in the local memory, even if the array is almost as large as the available memory, FITRLINEAR will be able to cope with it as it was designed for these cases, by not creating extra copies of the input array or using derived arrays of similar or larger size.
If the whole predictor matrix does not fit in memory, use a hierarchical approach, for example, run FITCLINEAR twice on half of the predictors, pick half of the predictors on each and run a third FITCLINEAR on the selected combined predictors. 
For large in-memory data that consumes more than 50% of RAM on your computer, fitrlinear would be better. The lasso function always standardizes the data and doubles the memory footprint because of that. fitrlinear doesn't standardize.

More Answers (0)

Tags

No tags entered yet.

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!