File Exchange

image thumbnail


version 1.1 (2.8 KB) by

Parse a sentence or any string into distinct "words"



View License

Sentence parsing can be done one word at a time using strtok. However, sometimes it is useful to (efficiently) extract all words into a cell array in one function call. The function allwords.m does exactly this.

Spaces, white space (tabs), carriage returns, and punctuation characters are all valid separator characters by default. In this example, I had a period at the end, as well as multiple spaces between some words.

str = 'The quick brown fox jumped over the lazy dog.';
words = allwords(str)
words =
'The' 'quick' 'brown' 'fox' 'jumped' 'over' 'the' 'lazy' 'dog'

This utility can also work on any integer vector. The default separators for numeric vectors are [-inf inf NaN], but you can assign any separators you desire. Here, parse a string of integers, with only NaN elements as the separator.

str = [1 2 4 2 inf 3 3 5 nan 4 6 5];
words = allwords(str,nan);
ans =
1 2 4 2 Inf 3 3 5

ans =
4 6 5

Finally, allwords is efficient. For example, on a random numeric string of length 1e6, allwords parses it into over 90000 distinct "words" in less than 0.5 seconds.

str = round(rand(1,1000000)*10);
words = allwords(str,[0 10]);
Elapsed time is 0.455194 seconds.

There were over 90000 different words that were extracted

ans =

The longest word had length 104.

ans =

Comments and Ratings (10)

Toby Johnson

Great code although i found one small quirk in that it wouldn't detect /t (tabs) as delimiting characters. This was simply fixed by providing an array of sepchars with the /t included. For instance my sepchars array looked something like this sepchars = [' ' sprintf('/t')] for space and tab delimited searching.

does just what it is supposed to do.


Lukas (view profile)

John D'Errico

John D'Errico (view profile)

Jos - I'll concede that maybe allwords is not the best choice of names. Since this is kind of an extension of strtok, how about parsetok? alltoks? parsewords?

You had suggested parsearray, but it does not really operate on an array. But parsevec might make sense.

Jos (10584)

Jos (10584) (view profile)

A well-designed tool, which is useful in many ways! Yet I wonder if allwords is a proper name for this. May I suggest "splitarray", "parsearray"?

Image Analyst

Image Analyst (view profile)

This is also a very nice program for splitting up a directory path.
str = 'C:\Program Files\MATLAB\work\UserExamples'
words = allwords(str,'\')
words =
'C:' 'Program Files' 'MATLAB' 'work' 'UserExamples'
So, for example, you could use this to get the name of a parent directory, or go up 2 directory levels, or go to the 3rd directory level, or even to build up the name of a file (such as an Excel file) based on the folder name (such as "Results for UserExamples.xls). So I think it can be used in quite flexible ways.

John D'Errico

John D'Errico (view profile)

I'd need then to split the code for strings versus numeric vectors. Of course, I expect that strings are the more common use for this, so it makes sense to make it efficient for that case if possible.

Jos (10584)

Jos (10584) (view profile)

Nice function, especially for non-strings. For strings however, the use of, e.g., STRREAD is equally flexible and faster:

str = 'The ## quick brown dog jumXXXped over; the lazy fox.';
words = strread(str,'%s','whitespace',[' .,;:?!',char([9 13]) 'X#']) ;

John D'Errico

John D'Errico (view profile)

A minor problem is the suggested alternative fails to work for numeric vectors, so I would need to have two distinct engines that depend on the input vector class. Furthermore, while the above code works nicely to find white space, but it would need some additional manipulation to handle a fully general case.

I imagine that I can improve the speed of this code however, as most of the time is taken up by the latter part of the code. I'll do some more play with various methods to see if I can find someplace to gain speed.

Damien Garcia

Damien Garcia (view profile)

Valuable code, but I would suggest something as simple as:
words = regexp(str,'\w+','match')
which seems to be faster and could be easily adapted to the above-mentioned syntaxes.

As an example, try the following:
str = 'The quick brown fox jumped over the lazy dog.';
strlong = repmat(str,1,10000);
words1 = allwords(strlong);
words2 = regexp(strlong,'\w+','match');

Regards, D.G.



Speed enhancement for character strings, plus I added a reference to the wordcount function.

MATLAB Release
MATLAB 7.10 (R2010a)

Inspired by: wordcount

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video