read text files in different folders and sub-folders in D:\ partition and count occurrence of a word ?

1 view (last 30 days)
I have some text files in different folders that folders named as folder 1 , folder 2 , folder 3 ... etc , these folders in D partition ,
and i have also list of particular words ... i need to count how many that words appear in each text file ( occurrence of words ) suppose these occurrence list of words need to be counts store in text file named dictionary ...
Q : I would be thankful if you can provide any help. ? this my simple code i tried ,,,
function fileList = getAllFiles(dirName)
wordList={'D:\dictionary.txt'}; %list of keywords you want to count
wordCount=nan(size(wordList)); %The counts will be here
dirData = dir(dirName); %# Get the data for the current directory
dirIndex = [dirData.isdir]; %# Find the index for directories
fileList = {dirData(~dirIndex).name}'; %'# Get a list of the files
if ~isempty(fileList)
fileList = cellfun(@(x) fullfile(dirName,x),... %# Prepend path to files
fileList,'UniformOutput',false);
end
subDirs = {dirData(dirIndex).name}; %# Get a list of the subdirectories
validIndex = ~ismember(subDirs,{'.','..'}); %# Find index of subdirectories
%# that are not '.' or '..'
for iDir = find(validIndex) %# Loop over valid subdirectories
nextDir = fullfile(dirName,subDirs{iDir}); %# Get the subdirectory path
fileList = [fileList; getAllFiles(nextDir)]; %# Recursively call getAllFiles
wordCount( filelist);
end
end

Answers (1)

Walter Roberson
Walter Roberson on 13 Aug 2015
wordList={'D:\dictionary.txt'} does not read the file D:\dictionary.txt . You need to have your code read the contents of the file and split it into words. The best way of doing that depends upon the file format and upon whether it is possible for a "word" to have a space in it . For example, some people write
alright
and some people (more properly) write
all right
and you need to decide whether the "all right" version is to count as a single word or as two words. But as we will see later there is no point in doing it in this routine.
Then you initialize your counter for each word. That is fine the way it is, but as we will see later there is no point in doing it in this routine.
Just before the end of your last loop, you have
wordCount( filelist);
That statement attempts to access the numeric counter array at locations which are defined by the cell array of strings that is filelist. That is going to fail. You cannot access an array with a cell array.
If you could access at a cell array, then notice that your "for" loop is adding on to filelist each time, so each iteration of the loop you would be accessing locations that you had already accessed. That probably is not efficient.
Notice that if there are no subdirectories then you never use wordCount( filelist) on the filelist that is built up. If you thought that you were counting the words in each file, then think again.
Okay now notice that you call the routine you are in recursively add add the results to the file list. So clearly the result of the routine is expected to be just a file list. The whole point of the routine is to just get the names of the files. If it were counting the words as you go then you would need a different data structure for the output, not just a list of files.
So why are you creating wordCount as a matrix at all in this routine? Why are you worried about the name of the file with the list of words here, and why would you worry about reading the list of words here?
Your routine should have nothing to do with the word list and should have nothing to do with counting. Instead, your routine should just return the file names. And then you should have some other routine that worries about counting the words for each of the file names it has been told to work on.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!