Thread Subject: Words Segmentation

Subject: Words Segmentation

From: Andrew Wayne

Date: 15 Apr, 2008 07:50:03

Message: 1 of 11

Hello guys......
I want to Extract words fro a text line,
then what is the best method using MATLAB???

All replies Are Appreciated...................

Subject: Words Segmentation

From: Pekka

Date: 15 Apr, 2008 12:22:02

Message: 2 of 11

"Andrew Wayne" <ics2008_contact@yahoo.com> wrote in message
<fu1mnb$lld$1@fred.mathworks.com>...
> Hello guys......
> I want to Extract words fro a text line,
> then what is the best method using MATLAB???
>
> All replies Are Appreciated...................

Many ways. For example assuming only spaces, this one is
reasonably short to type
str = 'word1 word2 word3';
words = regexp(str,' ','split')

Subject: Words Segmentation

From: Abel Brown

Date: 15 Apr, 2008 13:40:06

Message: 3 of 11

"Pekka " <pekka.nospam.kumpulainen@tut.please.fi> wrote in
message <fu26l9$d0p$1@fred.mathworks.com>...
> "Andrew Wayne" <ics2008_contact@yahoo.com> wrote in message
> <fu1mnb$lld$1@fred.mathworks.com>...
> > Hello guys......
> > I want to Extract words fro a text line,
> > then what is the best method using MATLAB???
> >
> > All replies Are Appreciated...................
>
> Many ways. For example assuming only spaces, this one is
> reasonably short to type
> str = 'word1 word2 word3';
> words = regexp(str,' ','split')
>
NOOOOO!!!! this is very slow!!!!

check out "split" or "split2" on the MATLAB file exchange.
The implementation in split2 uses textscan which is much
much faster than "strtok" or matlab regexp's.

of course if you only want to parse a few lines then who
cares about the speed? But if you're going to parse 100000
lines then textscan is your ONLY option!

You also need to be careful about typecasting strings to
numbers via str2double or even slower str2num. These
functions are very slow. So if you can use textscan to
parse strings to strings and numbers to numbers that'll save
you alot of time!


Example:

    cell_array = textscan(string,'%s');
OR
    cell_array =textscan(string,'%s','delimiter',':');

    cell_array =textscan(string,'%s','delimiter','/');

    ...

    %reshape
    cell_array = cell_array{1};

Subject: Words Segmentation

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 15 Apr, 2008 16:03:59

Message: 4 of 11

In article <fu1mnb$lld$1@fred.mathworks.com>,
Andrew Wayne <ics2008_contact@yahoo.com> wrote:
>Hello guys......
>I want to Extract words fro a text line,
>then what is the best method using MATLAB???

>All replies Are Appreciated...................

What is a "word" ?

The below is an example I used a week ago for someone wanting
to separate out paragraphs. Which are the words in this?

  Mr. Todd E. Jones gave $3000. (!) in nickels, dimes, etc. to his
  No. 1 son at 4 7th Ave. N., NY. NY. USA., who exclaimed "What joy!
  Now I can buy 3 lbs. of St. Tropiz bananas... or can I?!"


--
  "Nothing recedes like success." -- Walter Winchell

Subject: Words Segmentation

From: Abel Brown

Date: 15 Apr, 2008 17:41:47

Message: 5 of 11

roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fu2jlf$aia$1@canopus.cc.umanitoba.ca>...
> In article <fu1mnb$lld$1@fred.mathworks.com>,
> Andrew Wayne <ics2008_contact@yahoo.com> wrote:
> >Hello guys......
> >I want to Extract words fro a text line,
> >then what is the best method using MATLAB???
>
> >All replies Are Appreciated...................
>
> What is a "word" ?
>
> The below is an example I used a week ago for someone wanting
> to separate out paragraphs. Which are the words in this?
>
> Mr. Todd E. Jones gave $3000. (!) in nickels, dimes,
etc. to his
> No. 1 son at 4 7th Ave. N., NY. NY. USA., who exclaimed
"What joy!
> Now I can buy 3 lbs. of St. Tropiz bananas... or can I?!"
>
>
> --
> "Nothing recedes like success." -- Walter
Winchell

words are [a-zA-Z0-9] That's pretty simple yeah.

or more specifically \w+ in perl-speak

Subject: Words Segmentation

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 15 Apr, 2008 17:54:46

Message: 6 of 11

In article <fu2pcr$150$1@fred.mathworks.com>,
Abel Brown <brown.2179@osu.edu> wrote:
>roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
>message <fu2jlf$aia$1@canopus.cc.umanitoba.ca>...

>> What is a "word" ?

>words are [a-zA-Z0-9] That's pretty simple yeah.

>or more specifically \w+ in perl-speak

It isn't that simple in English. In English, there are
contractions, possessives, abbreviations, and some symbols,
all treated as words.

For example, for "its'", "it's", "etc.", and "&" the word
is everything inside the double-quote marks. The apostrophes
in "its'" and "it's" are not punctuation: they are part of the word.
Similarily, the period at the end of "etc." is part of the word.

If you are reading a document (especially a financial document)
and it has a capital N and a slightly-raised small o, then that
is the word "number" (or, depending on the context, "numbering").
--
   "Any sufficiently advanced bug is indistinguishable from a feature."
   -- Rich Kulawiec

Subject: Words Segmentation

From: Abel Brown

Date: 15 Apr, 2008 18:05:04

Message: 7 of 11

roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fu2q56$jl7$1@canopus.cc.umanitoba.ca>...
> In article <fu2pcr$150$1@fred.mathworks.com>,
> Abel Brown <brown.2179@osu.edu> wrote:
> >roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
> >message <fu2jlf$aia$1@canopus.cc.umanitoba.ca>...
>
> >> What is a "word" ?
>
> >words are [a-zA-Z0-9] That's pretty simple yeah.
>
> >or more specifically \w+ in perl-speak
>
> It isn't that simple in English. In English, there are
> contractions, possessives, abbreviations, and some symbols,
> all treated as words.
>
> For example, for "its'", "it's", "etc.", and "&" the word
> is everything inside the double-quote marks. The apostrophes
> in "its'" and "it's" are not punctuation: they are part of
the word.
> Similarily, the period at the end of "etc." is part of the
word.
>
> If you are reading a document (especially a financial
document)
> and it has a capital N and a slightly-raised small o, then
that
> is the word "number" (or, depending on the context,
"numbering").
> --
> "Any sufficiently advanced bug is indistinguishable
from a feature."
> -- Rich Kulawiec

then use perl to pre-process the data before using matlab.
You can even call your perl from within matlab (im sure you
already know :).

I understand what you're talking about. It's just the
original poster had a very basic question. Hence, a simple
answer ...

Subject: Words Segmentation

From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)

Date: 15 Apr, 2008 18:14:29

Message: 8 of 11

In article <fu2qog$g5t$1@fred.mathworks.com>,
Abel Brown <brown.2179@osu.edu> wrote:

>I understand what you're talking about. It's just the
>original poster had a very basic question. Hence, a simple
>answer ...

We don't know what the original poster meant by "word", so clarification
had to be sought. In this newsgroup, we -often- get "simple" questions
that are not so simple after all, because the poster has a particular
meaning of some phrase in mind and doesn't realize that the phrase
can mean a number of other things as well. It is not uncommon here
for people to change their questions and want to do something different
than they originally wanted, once alternative shades of meaning
are pointed out to them.

--
  "To all, to each! a fair good-night,
   And pleasing dreams, and slumbers light" -- Sir Walter Scott

Subject: Words Segmentation

From: Andrew Wayne

Date: 15 Apr, 2008 21:01:03

Message: 9 of 11

roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
message <fu2ra5$lab$1@canopus.cc.umanitoba.ca>...
> In article <fu2qog$g5t$1@fred.mathworks.com>,
> Abel Brown <brown.2179@osu.edu> wrote:
>
> >I understand what you're talking about. It's just the
> >original poster had a very basic question. Hence, a
simple
> >answer ...
>
> We don't know what the original poster meant by "word",
so clarification
> had to be sought. In this newsgroup, we -often-
get "simple" questions
> that are not so simple after all, because the poster has
a particular
> meaning of some phrase in mind and doesn't realize that
the phrase
> can mean a number of other things as well. It is not
uncommon here
> for people to change their questions and want to do
something different
> than they originally wanted, once alternative shades of
meaning
> are pointed out to them.
>
> --
> "To all, to each! a fair good-night,
> And pleasing dreams, and slumbers light" -- Sir
Walter Scott
I meant by word that for example we have the following text
line:
I Want tomorrow to go for shopping as usual

Then I want to extract each word so that I get:
I
Want
tomorrow
to
go.....etc

Subject: Words Segmentation

From: Abel Brown

Date: 15 Apr, 2008 21:18:02

Message: 10 of 11

"Andrew Wayne" <ics2008_contact@yahoo.com> wrote in message
<fu352f$in$1@fred.mathworks.com>...
> roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) wrote in
> message <fu2ra5$lab$1@canopus.cc.umanitoba.ca>...
> > In article <fu2qog$g5t$1@fred.mathworks.com>,
> > Abel Brown <brown.2179@osu.edu> wrote:
> >
> > >I understand what you're talking about. It's just the
> > >original poster had a very basic question. Hence, a
> simple
> > >answer ...
> >
> > We don't know what the original poster meant by "word",
> so clarification
> > had to be sought. In this newsgroup, we -often-
> get "simple" questions
> > that are not so simple after all, because the poster has
> a particular
> > meaning of some phrase in mind and doesn't realize that
> the phrase
> > can mean a number of other things as well. It is not
> uncommon here
> > for people to change their questions and want to do
> something different
> > than they originally wanted, once alternative shades of
> meaning
> > are pointed out to them.
> >
> > --
> > "To all, to each! a fair good-night,
> > And pleasing dreams, and slumbers light" -- Sir
> Walter Scott
> I meant by word that for example we have the following text
> line:
> I Want tomorrow to go for shopping as usual
>
> Then I want to extract each word so that I get:
> I
> Want
> tomorrow
> to
> go.....etc

%do this
cell_array = textscan(string,'%s');

Subject: Words Segmentation

From: sara

Date: 10 Feb, 2012 03:51:10

Message: 11 of 11

hi,

Did you find any way or any code to segment the text?

Thanks,
Sara


"Mohammed Ahmed" wrote in message <fu1mnb$lld$1@fred.mathworks.com>...
> Hello guys......
> I want to Extract words fro a text line,
> then what is the best method using MATLAB???
>
> All replies Are Appreciated...................

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
words perl split Abel Brown 15 Apr, 2008 13:45:09
rssFeed for this Thread

Contact us at files@mathworks.com