Read csv strings, keep or create surrounding whitespace

5 views (last 30 days)
I have a list of stop words that currently exists as a comma-separated list in a .txt file. The goal is to use that list to remove those words from some target text, but only when a given word (e.g. "and") appears by itself - remove "and", but don't make "sand" into "s". To that end, I tried manually putting spaces around all the words in the list, so "a,able,about" became " a , able , about ". However, the txtscan function stripped the spaces out. Is there a way to prevent it from doing that? Alternatively, if I use the original form of the list, can I tell txtscan to surround each string with spaces?
  1 Comment
Cedric
Cedric on 20 Jun 2014
Edited: Cedric on 20 Jun 2014
Could you give an example, like a sample file, and indicate precisely what you want to achieve? This seems to be a task for REGEXPREP.

Sign in to comment.

Accepted Answer

Cedric
Cedric on 20 Jun 2014
Edited: Cedric on 20 Jun 2014
Here is an example that I can refine if you provide more information. It writes some keywords in upper case..
key = {'lobster', 'and'} ;
str = 'Lobster anatomy includes the cephalothorax which fuses the head and the thorax, both of which are covered by a chitinous carapace, and the abdomen. The lobster''s head bears antennae, antennules, mandibles, the first and second maxillae, and the first, second, and third maxillipeds. Because lobsters live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.' ;
for kId = 1 : length( key )
pat = sprintf( '(?<=\\W?)%s(?=(s |\\W))', key{kId} ) ;
str = regexprep( str, pat, upper( key{kId} ), 'ignorecase' ) ;
end
Running this, you get
>> str
str =
LOBSTER anatomy includes the cephalothorax which fuses the head AND the thorax, both of which are covered by a chitinous carapace, AND the abdomen. The LOBSTER's head bears antennae, antennules, mandibles, the first AND second maxillae, AND the first, second, AND third maxillipeds. Because LOBSTERs live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.
The REXEXP-based approach makes it possible to code for..
  • only if framed by non alphanumeric characters (e.g. ,),
  • unless following character is an 's',
  • unless at the beginning of the string.
  21 Comments
Ben
Ben on 23 Jun 2014
Ah, I hadn't realized that regexp functions don't do their work all at once, as stringrep does. That should do it. Thank you so much!
Cedric
Cedric on 23 Jun 2014
Edited: Cedric on 23 Jun 2014
You're welcome! Note that it could do its job all at once if you were passing a pattern which contains all keywords in an OR operation. Yet, it's often more efficient to apply several times a simple pattern than passing once an extra-long/complex one. That could/should be profiled for your specific case though if you wanted to optimize.

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!