Code covered by the BSD License  

Highlights from
allwords

5.0

5.0 | 3 ratings Rate this file 29 Downloads (last 30 days) File Size: 2.8 KB File ID: #27184

allwords

by

 

07 Apr 2010 (Updated )

Parse a sentence or any string into distinct "words"

| Watch this File

File Information
Description

Sentence parsing can be done one word at a time using strtok. However, sometimes it is useful to (efficiently) extract all words into a cell array in one function call. The function allwords.m does exactly this.

Spaces, white space (tabs), carriage returns, and punctuation characters are all valid separator characters by default. In this example, I had a period at the end, as well as multiple spaces between some words.

str = 'The quick brown fox jumped over the lazy dog.';
words = allwords(str)
words =
  'The' 'quick' 'brown' 'fox' 'jumped' 'over' 'the' 'lazy' 'dog'
 
This utility can also work on any integer vector. The default separators for numeric vectors are [-inf inf NaN], but you can assign any separators you desire. Here, parse a string of integers, with only NaN elements as the separator.

str = [1 2 4 2 inf 3 3 5 nan 4 6 5];
words = allwords(str,nan);
words{1}
ans =
     1 2 4 2 Inf 3 3 5

words{2}
ans =
     4 6 5
 
Finally, allwords is efficient. For example, on a random numeric string of length 1e6, allwords parses it into over 90000 distinct "words" in less than 0.5 seconds.

str = round(rand(1,1000000)*10);
tic
words = allwords(str,[0 10]);
toc
Elapsed time is 0.455194 seconds.
  
There were over 90000 different words that were extracted

numel(words)
ans =
     90310
 
The longest word had length 104.

max(cellfun(@numel,words))
ans =
     104

Acknowledgements

Wordcount inspired this file.

MATLAB release MATLAB 7.10 (R2010a)
Other requirements There is nothing in this code that prevents its use back as far as version 6.5 or so.
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (8)
20 Nov 2014 Lukas  
09 Apr 2010 John D'Errico

Jos - I'll concede that maybe allwords is not the best choice of names. Since this is kind of an extension of strtok, how about parsetok? alltoks? parsewords?

You had suggested parsearray, but it does not really operate on an array. But parsevec might make sense.

09 Apr 2010 Jos (10584)

A well-designed tool, which is useful in many ways! Yet I wonder if allwords is a proper name for this. May I suggest "splitarray", "parsearray"?

08 Apr 2010 Image Analyst

This is also a very nice program for splitting up a directory path.
str = 'C:\Program Files\MATLAB\work\UserExamples'
words = allwords(str,'\')
words =
'C:' 'Program Files' 'MATLAB' 'work' 'UserExamples'
So, for example, you could use this to get the name of a parent directory, or go up 2 directory levels, or go to the 3rd directory level, or even to build up the name of a file (such as an Excel file) based on the folder name (such as "Results for UserExamples.xls). So I think it can be used in quite flexible ways.

08 Apr 2010 John D'Errico

I'd need then to split the code for strings versus numeric vectors. Of course, I expect that strings are the more common use for this, so it makes sense to make it efficient for that case if possible.

08 Apr 2010 Jos (10584)

Nice function, especially for non-strings. For strings however, the use of, e.g., STRREAD is equally flexible and faster:

str = 'The ## quick brown dog jumXXXped over; the lazy fox.';
words = strread(str,'%s','whitespace',[' .,;:?!',char([9 13]) 'X#']) ;

07 Apr 2010 John D'Errico

A minor problem is the suggested alternative fails to work for numeric vectors, so I would need to have two distinct engines that depend on the input vector class. Furthermore, while the above code works nicely to find white space, but it would need some additional manipulation to handle a fully general case.

I imagine that I can improve the speed of this code however, as most of the time is taken up by the latter part of the code. I'll do some more play with various methods to see if I can find someplace to gain speed.

07 Apr 2010 Damien Garcia

Valuable code, but I would suggest something as simple as:
words = regexp(str,'\w+','match')
which seems to be faster and could be easily adapted to the above-mentioned syntaxes.

As an example, try the following:
-----------------------------------
str = 'The quick brown fox jumped over the lazy dog.';
strlong = repmat(str,1,10000);
words1 = allwords(strlong);
words2 = regexp(strlong,'\w+','match');
isequal(words1,words2)
-----------------------------------

Regards, D.G.

Updates
07 Apr 2010

Speed enhancement for character strings, plus I added a reference to the wordcount function.

Contact us