Problem with huge epinions data-set

Question

shayan asadpoor on 5 Feb 2016

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/266841-problem-with-huge-epinions-data-set

Commented: Walter Roberson on 7 Feb 2016

Hello everyone. I have epinions data-set that has 124000 users and 750000 items. Im implementing recommendation system to solve cold start problem. In first phase I should change the data-set to M-dimensional format. M is total number of items. It means 124000 rows and 750000 cols. Its really huge data-set. I used zeros(124000,750000), But memory problem occurred. I used sparse(124000,750000), It was ok but the process to fill the matrix takes about 2 days to finish (because of huge dimensional of dataset). What can I do now ? The next phase is performing K-means clustering on this M-dimensional data-set. Thanks

2 Comments
Show NoneHide None

Walter Roberson on 7 Feb 2016

In a duplicate question the poster asks:

I'm working on recommendation system to solve cold start problem in matlab, first of all I get extended version of epinions data-set, its really big data-set, about 13 millions rating, 120000 users and 755000 items, my problem is because working on this data-set needs high memory, I cant do anything on it, I asked my supervisor and he said I should use sampling, but how can I sampling 13 million rows ? Every statement that I perform on this dataset takes about one week to be finished, what can I do now ? Thank every one

Walter Roberson on 7 Feb 2016

In order for us to suggest sampling methods, we need to have something about how the data is stored, and whether there is any kind of sorting already applied (because that will affect our idea of random selection.)

Sign in to comment.

Sign in to answer this question.

Answer 1

the cyclist on 5 Feb 2016

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/266841-problem-with-huge-epinions-data-set#answer_208742

If I calculate correctly, that is approximately a 700 GB dataset. I don't think it is possible to hold that dataset in MATLAB memory all at once.

I can imagine some solutions to this, that allow to stay within MATLAB. For example, some combination of random sampling and using PCA to reduce your dataset.

2 Comments
Show NoneHide None

shayan asadpoor on 5 Feb 2016

Thanks for your answer, normal epinions dataset is 350mb, 13000000 rows and 4 cols, I should change the dataset to 124000 rows and 750000 cols, because of next step is clustering, I cant reduce or sampling dataset, then the result is not correct, I need all users and all item ratings. Im implementing according to one 2013 paper, in this paper author done this phase and I have to do it too.

the cyclist on 5 Feb 2016

OK. I'll try to say a few helpful things.

I do not believe that you will be able to fit that large dataset into MATLAB memory all at once.
The other dataset you mention, 13000000 x 4, is smaller by a factor of about 1700 and fits into MATLAB.
I believe there might be variants of the K-means method that allow processing in batches, instead of the entire dataset. I think they give a close approximation to doing one large batch. (I don't know if MATLAB implements those algorithms. I doubt it, though.)
If you must do this all in one batch, you might need to use something other than MATLAB.
You might have to break away from your thinking of a "correct" result. K-means is already an approximate technique, with expected error. You also have sampling error with your users. The question is whether other error you introduce is expected to be too large. You could run the algorithm multiple times, with random sampling, to see how big the error is.