Matlab extract url from html source

9 views (last 30 days)
hinhthoi
hinhthoi on 2 May 2012
Commented: Gobert on 13 Jun 2021
Hi, I am trying to extract all urls from a HTML source code. I used strfind command to find "http" as the starting of url and ".html", ".php" , ".png" as the end of the url. After that i join the starting and the ending to form a complete URL
But this give very bad result because it usually mix up.
I want to ask if there is any easier way to do this?
I'm thinking about searching for a pattern, a single command to give all urls that start with http:// and end with .html , .php, or .png
In the html source code, there are some other url extension, but i want to ignore all of them.
Thank you very much for any help

Answers (3)

Jason Ross
Jason Ross on 2 May 2012
I would do this using a series of regular expressions. Take a look at "Parsing Strings with Regular Expressions" on the following page for an example. It uses email addresses, but doing it for a URL is very similar since you know how it starts and ends, and you care about what's in between.

Walter Roberson
Walter Roberson on 2 May 2012
regexp(TheString, 'http://.*?\.(html|php|png)')
However, this cannot notice that (say) http://mathworks.com/scripts.htmlx/logo.png should extend to the .png instead of just to the .html . In order to be able to determine that you have reached the end of the URI, you need to know the list of characters which terminate URI in your context. Taking into account that sloppy pages often send URI with embedded blanks, which is syntactically invalid...

Abhisar Ekka
Abhisar Ekka on 13 Feb 2021
You can run this piece of code and it works.
html = webread("<----paste your url here ---->");
hyperlinks = regexp(html,'https?://[^"]+','match')'
Inside webread, paste your url. Webread does the work of reading & parsing the html code . And upon using regexp which matches regular expression we get all kinds of http and https links in the url.
  1 Comment
Gobert
Gobert on 13 Jun 2021
How can one check each html code to find emails? For example, see below: How to make this code work?
html = webread("https://edition.cnn.com");
hyperlinks = regexp(html,'https?://[^"]+','match')';
rgx ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
emails = regexpi(hyperlinks,rgx,'match')';

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!