File Exchange

image thumbnail


version (2.8 KB) by John D'Errico
Parse a sentence or any string into distinct "words"


Updated 07 Apr 2010

View License

Sentence parsing can be done one word at a time using strtok. However, sometimes it is useful to (efficiently) extract all words into a cell array in one function call. The function allwords.m does exactly this.

Spaces, white space (tabs), carriage returns, and punctuation characters are all valid separator characters by default. In this example, I had a period at the end, as well as multiple spaces between some words.

str = 'The quick brown fox jumped over the lazy dog.';
words = allwords(str)
words =
'The' 'quick' 'brown' 'fox' 'jumped' 'over' 'the' 'lazy' 'dog'

This utility can also work on any integer vector. The default separators for numeric vectors are [-inf inf NaN], but you can assign any separators you desire. Here, parse a string of integers, with only NaN elements as the separator.

str = [1 2 4 2 inf 3 3 5 nan 4 6 5];
words = allwords(str,nan);
ans =
1 2 4 2 Inf 3 3 5

ans =
4 6 5

Finally, allwords is efficient. For example, on a random numeric string of length 1e6, allwords parses it into over 90000 distinct "words" in less than 0.5 seconds.

str = round(rand(1,1000000)*10);
words = allwords(str,[0 10]);
Elapsed time is 0.455194 seconds.

There were over 90000 different words that were extracted

ans =

The longest word had length 104.

ans =

Cite As

John D'Errico (2021). allwords (, MATLAB Central File Exchange. Retrieved .

Comments and Ratings (11)

Raffaello Camoriano

Toby Johnson

Great code although i found one small quirk in that it wouldn't detect /t (tabs) as delimiting characters. This was simply fixed by providing an array of sepchars with the /t included. For instance my sepchars array looked something like this sepchars = [' ' sprintf('/t')] for space and tab delimited searching.

Jeremy Riousset

does just what it is supposed to do.


John D'Errico

Jos - I'll concede that maybe allwords is not the best choice of names. Since this is kind of an extension of strtok, how about parsetok? alltoks? parsewords?

You had suggested parsearray, but it does not really operate on an array. But parsevec might make sense.

Jos (10584)

A well-designed tool, which is useful in many ways! Yet I wonder if allwords is a proper name for this. May I suggest "splitarray", "parsearray"?

Image Analyst

This is also a very nice program for splitting up a directory path.
str = 'C:\Program Files\MATLAB\work\UserExamples'
words = allwords(str,'\')
words =
'C:' 'Program Files' 'MATLAB' 'work' 'UserExamples'
So, for example, you could use this to get the name of a parent directory, or go up 2 directory levels, or go to the 3rd directory level, or even to build up the name of a file (such as an Excel file) based on the folder name (such as "Results for UserExamples.xls). So I think it can be used in quite flexible ways.

John D'Errico

I'd need then to split the code for strings versus numeric vectors. Of course, I expect that strings are the more common use for this, so it makes sense to make it efficient for that case if possible.

Jos (10584)

Nice function, especially for non-strings. For strings however, the use of, e.g., STRREAD is equally flexible and faster:

str = 'The ## quick brown dog jumXXXped over; the lazy fox.';
words = strread(str,'%s','whitespace',[' .,;:?!',char([9 13]) 'X#']) ;

John D'Errico

A minor problem is the suggested alternative fails to work for numeric vectors, so I would need to have two distinct engines that depend on the input vector class. Furthermore, while the above code works nicely to find white space, but it would need some additional manipulation to handle a fully general case.

I imagine that I can improve the speed of this code however, as most of the time is taken up by the latter part of the code. I'll do some more play with various methods to see if I can find someplace to gain speed.

Damien Garcia

Valuable code, but I would suggest something as simple as:
words = regexp(str,'\w+','match')
which seems to be faster and could be easily adapted to the above-mentioned syntaxes.

As an example, try the following:
str = 'The quick brown fox jumped over the lazy dog.';
strlong = repmat(str,1,10000);
words1 = allwords(strlong);
words2 = regexp(strlong,'\w+','match');

Regards, D.G.

MATLAB Release Compatibility
Created with R2010a
Compatible with any release
Platform Compatibility
Windows macOS Linux

Inspired by: wordcount

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!