Problem with huge epinions data-set

2 views (last 30 days)
shayan asadpoor
shayan asadpoor on 5 Feb 2016
Commented: Walter Roberson on 7 Feb 2016
Hello everyone. I have epinions data-set that has 124000 users and 750000 items. Im implementing recommendation system to solve cold start problem. In first phase I should change the data-set to M-dimensional format. M is total number of items. It means 124000 rows and 750000 cols. Its really huge data-set. I used zeros(124000,750000), But memory problem occurred. I used sparse(124000,750000), It was ok but the process to fill the matrix takes about 2 days to finish (because of huge dimensional of dataset). What can I do now ? The next phase is performing K-means clustering on this M-dimensional data-set. Thanks
  2 Comments
Walter Roberson
Walter Roberson on 7 Feb 2016
In a duplicate question the poster asks:
I'm working on recommendation system to solve cold start problem in matlab, first of all I get extended version of epinions data-set, its really big data-set, about 13 millions rating, 120000 users and 755000 items, my problem is because working on this data-set needs high memory, I cant do anything on it, I asked my supervisor and he said I should use sampling, but how can I sampling 13 million rows ? Every statement that I perform on this dataset takes about one week to be finished, what can I do now ? Thank every one
Walter Roberson
Walter Roberson on 7 Feb 2016
In order for us to suggest sampling methods, we need to have something about how the data is stored, and whether there is any kind of sorting already applied (because that will affect our idea of random selection.)

Sign in to comment.

Answers (2)

the cyclist
the cyclist on 5 Feb 2016
If I calculate correctly, that is approximately a 700 GB dataset. I don't think it is possible to hold that dataset in MATLAB memory all at once.
I can imagine some solutions to this, that allow to stay within MATLAB. For example, some combination of random sampling and using PCA to reduce your dataset.
  2 Comments
shayan asadpoor
shayan asadpoor on 5 Feb 2016
Thanks for your answer, normal epinions dataset is 350mb, 13000000 rows and 4 cols, I should change the dataset to 124000 rows and 750000 cols, because of next step is clustering, I cant reduce or sampling dataset, then the result is not correct, I need all users and all item ratings. Im implementing according to one 2013 paper, in this paper author done this phase and I have to do it too.
the cyclist
the cyclist on 5 Feb 2016
OK. I'll try to say a few helpful things.
  • I do not believe that you will be able to fit that large dataset into MATLAB memory all at once.
  • The other dataset you mention, 13000000 x 4, is smaller by a factor of about 1700 and fits into MATLAB.
  • I believe there might be variants of the K-means method that allow processing in batches, instead of the entire dataset. I think they give a close approximation to doing one large batch. (I don't know if MATLAB implements those algorithms. I doubt it, though.)
  • If you must do this all in one batch, you might need to use something other than MATLAB.
  • You might have to break away from your thinking of a "correct" result. K-means is already an approximate technique, with expected error. You also have sampling error with your users. The question is whether other error you introduce is expected to be too large. You could run the algorithm multiple times, with random sampling, to see how big the error is.

Sign in to comment.


Walter Roberson
Walter Roberson on 6 Feb 2016
You will need to use something like mapreduce

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!