word count matrix problem

5 views (last 30 days)
Willem
Willem on 10 Nov 2013
Commented: Willem on 15 Nov 2013
Can anyone see how I can correct this code for the wordCount Matrix I am counting the unique words for all the files. I also have a 2469(unique words)*160(reviews) matrix.I have attached a snippet of the matrix for preview.
The problem I am having is that I am completely stuck on how to allocate the word counts relevant to each of the reviews. What is happening though is the total count is appearing in the first column and the rest are nil. I would very much appreciate it if someone could just have a look at my code and see if they can find the problem (probably really stupid error but I just cannot see it and have tried loads of methods to try and solve it but this appears to be the best one so far (for me at least)).
clear all;
% Collects requested files from a specified folder and inserts them into an array
fpath = ('C:\Users\Willem\Documents\MATLAB\fold1');
% Returns an error if folder is not found
if ~isdir(fpath)
errorMessage = sprintf('Error: The following folder does not exist:\n%s', fpath);
uiwait(warndlg(errorMessage));
return;
end
files = dir(fullfile(fpath,'*.oneline'));
nfiles = length(files);
data = {};
docArray = {};
if true
data = [];
% Seperates each files data strings into individual columns within the matrix
for k = 1:nfiles
thisdata = importdata(fullfile(fpath,files(k).name)); % imports the data into the matrix array
nrow = length(thisdata); % extend number of rows if needed
docArray(1:nrow,end+1) = thisdata(:); % displays each review per column
data = [data; importdata(fullfile(fpath,files(k).name))]; % creates single column array of all the words
end
end
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
for j = 1:length(uniqueWords)
counter = 0;
for l = 1:length(data)
if isequal(uniqueWords{j},data{l})
counter = counter +1;
end
end
wordCount(j) = counter;
end
  7 Comments
Willem
Willem on 11 Nov 2013
I have however found a way of counting the first file to column 1 but am still unable to find a way of looping it for each of the other files into the individual columns up to k=160. Code bellow (changed data for docArray):
if true
% code
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
thisreview = docArray(:,k);
for j = 1:length(uniqueWords)
counter = 0;
for l = 1:length(docArray)
if isequal(uniqueWords{j}, docArray{l})
counter = counter +1;
end
end
wordCount(j) = counter;
end
end
Willem
Willem on 12 Nov 2013
I am now extremely frustrated with this code because no matter how many times I try I still end up back at the same code (even with the hints and generous help of others.
I would be extremely grateful if somebody could PLEASE show me how to loop the word comparison count for each file so it then returns the values to a new column for each review.
clear all;
% Collects requested files from a specified folder and inserts them into an array
fpath = ('C:\Users\Willem\Documents\MATLAB\fold1');
% Returns an error if folder is not found
if ~isdir(fpath)
errorMessage = sprintf('Error: The following folder does not exist:\n%s', fpath);
uiwait(warndlg(errorMessage));
return;
end
files = dir(fullfile(fpath,'*.oneline'));
nfiles = length(files);
docArray = {};
data = [];
% Seperates each files data strings into individual columns within the matrix
for k = 1:nfiles
thisdata = importdata(fullfile(fpath,files(k).name)); % imports the data into the matrix array
nrow = length(thisdata); % extend number of rows if needed
docArray(1:nrow,end+1) = thisdata(:); % displays each review per column
data = [data; thisdata]; % creates single column array of all the words
end
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
for j = 1:length(uniqueWords)
if max(uniqueWords{j} ~= ' ')
for l = 1:length(docArray)
if strcmp(docArray(l), uniqueWords{j})
wordCount(j) = wordCount(j) +1;
end
end
end
end
This what I get so far
Sorry in advance if asking this annoys anyone but I have spent countless hours trying to get to grips with it and as a Matlab newbie its doing my head in that I just don't know how to solve it.

Sign in to comment.

Answers (3)

Cedric
Cedric on 11 Nov 2013
Edited: Cedric on 11 Nov 2013
Here is an alternate and probably simpler solution (because it's a 1 line solution after you update the call to UNIQUE) for counting occurrences:
>> words = { 'john', 'jim', 'john', 'john', 'james', 'john', 'james' } ;
>> [uniqueWords,~,ic] = unique( words )
uniqueWords =
'james' 'jim' 'john'
ic =
3 2 3 3 1 3 1
>> counts = accumarray( ic.', ones(size(ic)) )
counts =
2
1
4
  3 Comments
Walter Roberson
Walter Roberson on 12 Nov 2013
The poster would like to have a per-review count of each unique word.
The adapted code would probably use ismember() on each review (because not every review will have every unique word and order becomes important for the output.)
Willem
Willem on 12 Nov 2013
Thanks to both of but even with your advice and hints I am still just going around in circles can only put it down to I don't get it, sorry for my stupidity. I hope I have not been a pain to both of you.

Sign in to comment.


Walter Roberson
Walter Roberson on 10 Nov 2013
Hint:
thisreview = docArray(:,k);
if isequal(uniqueWords{j}, thisreview{L})
  5 Comments
Willem
Willem on 11 Nov 2013
I see now will have another crack at it this evening, thank you.
Willem
Willem on 11 Nov 2013
I have made some progress but am still not finding a successful solution. I can find the counts for selected files or for the first or last files in the folder and they all display in the first column. But am failing to find a way of showing all the files together in the matrix in their own columns. It just seems to be beyond my abilities at present I will just keep trying or will change method. Thank you for your time Walter you have been really helpful but I dare not take up any more of your time on this issue for fear of annoying you.

Sign in to comment.


Willem
Willem on 14 Nov 2013
I worked out the answer to the question with only minimal changes to my original code for anyone who wishes to take note of it but be warned it loops first through all the unique words (2400 of them) and then loops through each column (160 columns) and then loops through all the rows within the columns comparing the unique words with the words in each column and if any are found it counts the number of times the word occurs in that column and returns these count to a word count matrix. This does take quite some time to complete about 10-15 minutes in total. If anyone can think of a way to make this method more efficient I am more than happy to know as I have a much larger matrix to complete (4x lager) and so do not wish to be spending this length of time waiting for the matrix to compute word counts on a total of 800 files.
testCount = zeros(numel( trainTerms ), a); % sets a nil value matrix to the size required
for j = 1:numel( trainTerms )
for l = 1:size( docArray1, 1 )
for ll = 1:size( docArray1, 2 )
if (strcmp( docArray1 {l, ll}, trainTerms {j})==1)
testCount (j, ll) = testCount (j, ll) +1;
end
end
end
end
  8 Comments
Walter Roberson
Walter Roberson on 14 Nov 2013
Suppose you switched around the two cellstrings ?
Willem
Willem on 15 Nov 2013
I have discovered this which takes the tf counts matches them with their corresponding indexes and then counts them automatically. It is quick but am not sure how best to loop it for every document without slowing it down.
stringCount(:,a) = cellfun(@(x) sum(ismember(thisDataFold1,x)), trainTerms)
I am able to run it through my current loop instead of strcmp() but takes just as long so am guessing my loops are whats causing the delays. I am guessing that it may be looping more than once for each check judging by the results I am receiving at different times that I cut the process short (ctrl+c) (eg. 5 mins could show 3 columns and 2 mins could show 33 columns). Have tried to take a look at the code again but cannot see the error.

Sign in to comment.

Categories

Find more on Creating and Concatenating Matrices in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!