MATLAB Answers

How do I read the text between href tags and return the results in a cell array?

33 views (last 30 days)
StuartG
StuartG on 13 Jun 2016
Commented: Ana Alonso on 17 Dec 2019
Currently, I have an html webpage saved in a text format. Below is an example of the portion of the text I am interested in:
<a href='/some/1056-text-stuff'>
I want to search the text document for every case the "<a href='\some\ " pattern appears and extract the text between the tokens, i.e.
/some/1056-text-stuff
Matlab has regexp, match and tags but I am struggling to pick out the string cleanly. Ideally, I would like to search the document and return a cell array of strings which lists all of the matches. Here is my current code:
str= fileread('C:\Users\Me\Documents\MATLAB\trial.txt'); %read in text file
urls = regexp(str, 'href=(\S+)(\s*)$', 'tokens', 'lineAnchors'); %find urls

  0 Comments

Sign in to comment.

Accepted Answer

Julian
Julian on 17 Jun 2016
You can try something like
>> RE='<a[\s]+href="(?<target>.*?)"[^>]*>(?<text>.*?)</a>';
>> list=regexp(html, RE, 'names')
I can recommend this tool https://www.regexbuddy.com/

  2 Comments

StuartG
StuartG on 21 Jun 2016
Thank you very much, the regex command was giving me a lot of grief. I tailored your expression a little bit and it worked perfectly.
Ana Alonso
Ana Alonso on 17 Dec 2019
Hi there,
What do the (?<target>.*?) and (?<text>.*?) expressions correspond to?
I've never worked with html before and I'm just trying to scrape urls from the html code.
Thanks!

Sign in to comment.

More Answers (0)