Code covered by the BSD License

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

### Highlights from allwords

5.0
5.0 | 3 ratings Rate this file 6 Downloads (last 30 days) File Size: 2.8 KB File ID: #27184 Version: 1.1

# allwords

### John D'Errico (view profile)

07 Apr 2010 (Updated )

Parse a sentence or any string into distinct "words"

File Information
Description

Sentence parsing can be done one word at a time using strtok. However, sometimes it is useful to (efficiently) extract all words into a cell array in one function call. The function allwords.m does exactly this.

Spaces, white space (tabs), carriage returns, and punctuation characters are all valid separator characters by default. In this example, I had a period at the end, as well as multiple spaces between some words.

str = 'The quick brown fox jumped over the lazy dog.';
words = allwords(str)
words =
'The' 'quick' 'brown' 'fox' 'jumped' 'over' 'the' 'lazy' 'dog'

This utility can also work on any integer vector. The default separators for numeric vectors are [-inf inf NaN], but you can assign any separators you desire. Here, parse a string of integers, with only NaN elements as the separator.

str = [1 2 4 2 inf 3 3 5 nan 4 6 5];
words = allwords(str,nan);
words{1}
ans =
1 2 4 2 Inf 3 3 5

words{2}
ans =
4 6 5

Finally, allwords is efficient. For example, on a random numeric string of length 1e6, allwords parses it into over 90000 distinct "words" in less than 0.5 seconds.

str = round(rand(1,1000000)*10);
tic
words = allwords(str,[0 10]);
toc
Elapsed time is 0.455194 seconds.

There were over 90000 different words that were extracted

numel(words)
ans =
90310

The longest word had length 104.

max(cellfun(@numel,words))
ans =
104

Acknowledgements

Wordcount inspired this file.

MATLAB release MATLAB 7.10 (R2010a)
Other requirements There is nothing in this code that prevents its use back as far as version 6.5 or so.
Tags for This File   Please login to tag files.
Comments and Ratings (8)
20 Nov 2014 Lukas

### Lukas (view profile)

09 Apr 2010 John D'Errico

### John D'Errico (view profile)

Jos - I'll concede that maybe allwords is not the best choice of names. Since this is kind of an extension of strtok, how about parsetok? alltoks? parsewords?

You had suggested parsearray, but it does not really operate on an array. But parsevec might make sense.

Comment only
09 Apr 2010 Jos (10584)

### Jos (10584) (view profile)

A well-designed tool, which is useful in many ways! Yet I wonder if allwords is a proper name for this. May I suggest "splitarray", "parsearray"?

08 Apr 2010 Image Analyst

### Image Analyst (view profile)

This is also a very nice program for splitting up a directory path.
str = 'C:\Program Files\MATLAB\work\UserExamples'
words = allwords(str,'\')
words =
'C:' 'Program Files' 'MATLAB' 'work' 'UserExamples'
So, for example, you could use this to get the name of a parent directory, or go up 2 directory levels, or go to the 3rd directory level, or even to build up the name of a file (such as an Excel file) based on the folder name (such as "Results for UserExamples.xls). So I think it can be used in quite flexible ways.

08 Apr 2010 John D'Errico

### John D'Errico (view profile)

I'd need then to split the code for strings versus numeric vectors. Of course, I expect that strings are the more common use for this, so it makes sense to make it efficient for that case if possible.

Comment only
08 Apr 2010 Jos (10584)

### Jos (10584) (view profile)

Nice function, especially for non-strings. For strings however, the use of, e.g., STRREAD is equally flexible and faster:

str = 'The ## quick brown dog jumXXXped over; the lazy fox.';
words = strread(str,'%s','whitespace',[' .,;:?!',char([9 13]) 'X#']) ;

Comment only
07 Apr 2010 John D'Errico

### John D'Errico (view profile)

A minor problem is the suggested alternative fails to work for numeric vectors, so I would need to have two distinct engines that depend on the input vector class. Furthermore, while the above code works nicely to find white space, but it would need some additional manipulation to handle a fully general case.

I imagine that I can improve the speed of this code however, as most of the time is taken up by the latter part of the code. I'll do some more play with various methods to see if I can find someplace to gain speed.

Comment only
07 Apr 2010 Damien Garcia

### Damien Garcia (view profile)

Valuable code, but I would suggest something as simple as:
words = regexp(str,'\w+','match')
which seems to be faster and could be easily adapted to the above-mentioned syntaxes.

As an example, try the following:
-----------------------------------
str = 'The quick brown fox jumped over the lazy dog.';
strlong = repmat(str,1,10000);
words1 = allwords(strlong);
words2 = regexp(strlong,'\w+','match');
isequal(words1,words2)
-----------------------------------

Regards, D.G.

Comment only
07 Apr 2010 1.1

Speed enhancement for character strings, plus I added a reference to the wordcount function.