Match Substring in Cell Array and Copy To New Array
You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Show older comments
0 votes
Share a link to this question
Hello,
I'm attempting to match a substring in a 9k*1 cell array and copy it to a new array. Ideally I also would like to find where the substring start in the string (i.e. the char number), but I think it's better to do it one bridge at a time....
Sample strings:
- school of medecine university of CA in Los Angeles
- engineering programs in california
- University of CA in Los Angeles degrees
- University of CA in Irvine nursing online studies
- rankings of universities in california
- school of mathematics UCLA
- financial aid UCLA
- UCLA minorities financial aid
Desired output for example: ucla_programs= [ school of medecine university of CA in Los Angeles school of mathematics UCLA ]
ucla_financial= [ financial aid UCLA UCLA minorities financial aid ]
So far I've tried various versions of regexpi, such as;
for j =1:length(keywords)
%keywords(j)
if ~isempty(regexpi(keywords(j), 'UCLA', 'start'))
keywords(j);
end
end
But, this doesn't work, let alone it (suppose to ) return empty sets rather than the relevant sets.
I know how to do string manipulation in python/c++, but matlab isn't the most intuitive for me in this instance. I'd appreciate any suggestions/references you have.
Thank you,
Accepted Answer
per isakson
on 30 Jun 2013
Edited: per isakson
on 30 Jun 2013
I understand approximately half of your question. Not more because
- "substring start" doesn't match "ucla_financial= [ financial aid UCLA UCLA minorities financial aid ]". The example suggests that you search strings, which contain the two words: financial and UCLA, not a substring
- The example "ucla_programs= [ school of medecine university of CA in Los Angeles school of mathematics UCLA ]" is not helpful. The word, programs, doesn't appear in the selected strings.
Anyhow, here is an attempt
cac = { 'school of medecine university of CA in Los Angeles'
'engineering programs in california'
'University of CA in Los Angeles degrees'
'University of CA in Irvine nursing online studies'
'rankings of universities in california'
'school of mathematics UCLA'
'financial aid UCLA'
'UCLA minorities financial aid' };
keywords = { { 'UCLA', 'financial' } };
for ii = 1 : size(keywords,1)
for jj = 1 : size(cac,1)
words = regexp( cac{jj}, '[ \.\,\?\!]+', 'split' );
if all( ismember( keywords{ii}, words ) )
disp( cac{jj} )
end
end
end
outputs
financial aid UCLA
UCLA minorities financial aid
11 Comments
John
on 30 Jun 2013
Hi isakson,
Thank you for your reply and the sample code. I apologize for the incoherence, and you are right about your assumption of searching two words. But I've hinted it in "one bridge at a time."
The first round I'd like to create array of relevant words of first degree (for example, UCLA), then break it down to sub arrays (i.e. financial aid ucla and university of california in los angeles school of medicine), I hope that clarifies.
per isakson
on 30 Jun 2013
I associated "one bridge at a time" with "start in the string", i.e that position in the string could be determined in a second round.
"words of first degree" I googled on that term and got hits related to "Machine translation methods", which I know next to nothing about. Thus, I ask: which are the criteria of "relevant words of first degree "?
John
on 30 Jun 2013
Hi isakson,
By first degree I mean the same exact word (for example a string that contains UCLA and the search finds UCLA to be grouped in a new array). Then second degree may contain strings such as "engineering" "financial aid" which will be grouped in new sub arrays.
I probably misused the "words of first degree" term....
Now, I'm thinking to turn your code to a function, to be called recursively for first degree words first that will return the strings in order to create the first "line" of arrays (the more general ones), and then it'll called to generate the new focused arrays (which will contain all strings the contains financial aid and UCLA). Does this sound like a good approach?
Hi isakson,
I tried to convert your script to a function, but I get a 'cell' error. Here is your/my code: in the main file:
keywords = dataread('file', 'words.txt', '%s', 'delimiter', '\n');
search_term= { 'university' 'college' }; % i'd like it to be able to search one or more keywords, if possible
search_sub(keywords, search_term)
the function file:
function search_substring=search_sub(current_array, current_keyword)
for ii = 1 : size(current_keyword,1)
for jj = 1 : size(current_array,1)
words = regexpi( current_array{jj}, '[ \.\,\?\!]+', 'split' );
if all( ismember( current_keyword{ii}, words ) )
disp( current_array{jj} )
end
end
end
end
It accepts and runs great with one keyword to search. Couple of questions:
- Is there away to filter out bogus strings, such as "university a" or "university s," more specifically filter out any strings which contain a single letter before or after the keyword? For example, the function shouldn't return "university a" nor "university s"
- Having the function search more than one word? Right now for 'university' and 'college' it only runs once for 'university' and exits.
- Store the output of the function as an array in the main script? "university_arr= search_sub(keywords, search_term)" doesn't work.
Thank you very much for your help, btw very appreciated.
per isakson
on 1 Jul 2013
"financial aid", a group of two words might cause a problem.
ismember is the choice for finding two or more words in a string.
strcmp is somewhat faster for searching for one word.
Yes, I guess so, but I fail to grasp the complete problem.
John
on 1 Jul 2013
The complete problem is to cluster similar strings, and these steps are meant to break down the array into smaller pieces to be analyzed more in depth (such as kmeans or Levenshtein distance). The first rounds of sorting are meant to give some contextual meaning to the strings.
I'm not interested in searching for 'financial aid' but rather use the function to group together all strings that contain "university" and "college". And that the result will be outputed to a vector/n*1 matrix in the main program.
These two are my main problems at the moment:
search_term= { 'university' 'college' };
And
arr=search_sub(keywords, search_term)
which gives an error:
Error in search_sub (line 2)
for ii = 1 : size(current_keyword,1)
Output argument "search_substring" (and maybe others) not assigned during call to "D:\Data\MATLAB\search_sub.m>search_sub".
Error in keyword_sorting (line 24)
arr=search_sub(keywords, search_term);
If I call the function as described in my previous comment there is no error, while calling it as `arr=search_sub(....)` causing it to error.
per isakson
on 1 Jul 2013
Edited: per isakson
on 1 Jul 2013
This script
keywords = { 'school of medecine university of CA in Los Angeles'
'engineering programs in california'
'University of CA in Los Angeles degrees'
'University of CA in Irvine nursing online studies'
'rankings of universities in california'
'school of mathematics UCLA'
'financial aid UCLA'
'UCLA minorities financial aid' };
search_term = { 'university' 'college' };
search_sub( keywords, search_term )
returns
ans =
'school of medecine university of CA in Los Angeles'
where
function search_substring = search_sub( current_array, current_keyword )
search_substring = cell(0);
for ii = 1 : size(current_keyword,1)
for jj = 1 : size(current_array,1)
words = regexpi( current_array{jj}, '[ \.\,\?\!]+', 'split' );
if all( ismember( current_keyword{ii}, words ) )
search_substring = cat(1,search_substring,current_array(jj));
end
end
end
end
.
PS. Use textscan instead of dataread as recommended (which Matlab release do you use?).
There is a potential problem: ismember is case sensitive and there is no ismemberi. One walk-around is to work in lower case. Replace
if all( ismember( current_keyword(ii), words ) )
by
if all( ismember( lower(current_keyword{ii}), lower(words) ) )
You made small changes that corrupted my code - good night!
Hi isakson,
Thank you for the update, I've actually defined a global variable (as in the code below).
from the main file:
global matched_keyword
search_term= { 'university' 'college'};
search_sub(keywords, search_term);
university_arr=matched_keyword;
university_arr{1}
where the function:
function search_substring=search_sub(current_array, current_keyword)
%search_substring = cell();
counter=1;
global matched_keyword;
for ii = 1 : size(current_keyword,1)
for jj = 1 : size(current_array,1)
words = regexpi( current_array{jj}, '[ \.\,\?\!]+', 'split' );
if all( ismember( current_keyword{ii}, words ) )
%disp( current_array{jj} );
matched_keyword{counter}=current_array{jj};
counter= counter +1;
end
end
end
end
last question I've, as I mentioned above, is there a way to make the function accept and run two search arguments 'university' and 'college'? Where it'll search for both terms sequentially. Part of the issue I think is that
size(current_keyword,1)
ans =
1
where is suppose to return 2, is that correct?
John
on 1 Jul 2013
I'm actually calling the function
search_sub(keywords, (search_term)')
which transposes the search_term element, and now it searches through both elements.
Thank you very much for your help isakson.
per isakson
on 1 Jul 2013
Edited: per isakson
on 1 Jul 2013
This kind of "pair-programming" is prone to mistakes. It is only too easy to miss changes the other part makes. And the intention behind the changes and additions are not properly communicated.
.
Comments on the code in the comment above
The naming of the argument in the call of and the definition of the function, search_sub, could be more clear:
search_sub( keywords , search_term );
function search_substring=search_sub( current_array, current_keyword)
"now it searches through both elements". I doubt that. Your code does not work as intended (by me). Anyhow, in search_term = {'university' 'college'}; do you intend an AND or an OR? I assumed AND.
"transposes the search_term element,". I cannot see how that transpose affects the behavior of the code
Avoid globals! Replace search_substring by matched_keyword as output argument
.
Be aware that
- all( ismember( cas1, cas2 ) ) in the function, search_sub, assumes that both arguments are cell array of strings. current_keyword{ii} is a string in the function, sub_search
- ismember is case sensitive
.
Finally
per isakson
on 1 Jul 2013
Edited: per isakson
on 1 Jul 2013
You write: "Part of the issue I think is that
size(current_keyword,1)
ans =
1
where is suppose to return 2, is that correct?"
.
I'm lost. The problem is that I don't fully understand your requirements and that you don't fully understand the code, which I propose.
More Answers (0)
Categories
Find more on MATLAB in Help Center and File Exchange
Products
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)