How can I get all sub urls from host?

8 views (last 30 days)
Statira Meshkin
Statira Meshkin on 16 Jun 2015
Commented: Statira Meshkin on 16 Jun 2015
Hi,
I need to read all data from a url, but the issue is my url is like this:
"http://abc.efg.hij/klmnopqr%5stu.nsf/0/00DB180072B825?OpenDocument" or
"http://abc.efg.hij/klmnopqr%5stu.nsf/a33b09a7270068d/cc24f38720!OpenDocument"
and the last part (the serial number) is changed randomly. I just want a way to go to main site(as example: "http://abc.efg.hij/klmnopqr%5stu.nsf/") and get all urls in it.(There is about thousand sub urls on that main site)
Please let me know if any one can help.
Thank you in advance
Tara

Answers (2)

Walter Roberson
Walter Roberson on 16 Jun 2015
In the general case, enumerating the URLs that will be accepted by a site is not possible. URLs are evaluated by programs on the server, and need not correspond to an actual file.
You can urlread() or urlwrite() a specific URL and then parse the returned HTML to find anchors such as <a HREF=", extract each one, find the unique subset, and then iterate over each of them asking to fetch it in turn.

Statira Meshkin
Statira Meshkin on 16 Jun 2015
Thank you Walter,
but my problem is, because there is a lot links that I need to go in, I don't have enough time to go through all sub links in main site and get a copy of the address line.
  1 Comment
Walter Roberson
Walter Roberson on 16 Jun 2015
You can urlread() or urlwrite() a specific URL (i.e., the main site), and then parse the returned HTML to find anchors such as <a HREF=", extract each one, find the unique subset, and then iterate over each of them asking urlread() or urlwrite() to fetch each in turn.
For example an approximation (one that does not take comments into account) would be
SiteContents = urlread('http://abc.efg.hij/klmnopqr%5stu.nsf/');
ContainedURLs = regexp(SiteContents, 'http://[^"]+', 'match');
UniqueURLs = unique(ContainedURLs);
for K = 1 : length(UniqueURLs)
MinedData{K} = urlread(UniqueURLs{K});
end
But watch out in case the URL is relative instead of Absolute, and watch out in case the site points to itself.
The regexp() pattern expects the URL to extend to the first " following. And the crude code here does not attempt to distinguish image tags from anchors: that's an enhancement for you to work out.

Sign in to comment.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!