MATLAB Answers



Asked by Priya
on 7 Jun 2013


I want to copy and paste the contents of web page into a file, but using urlread or urlwrite, I get the HTML code for the webpage, instead I want to store the text content on that webpage to be stored in a text file or string array.


The question over there had to do with fetching the page and writing it to a file, for which I suggested urlwrite(), and to which you replied with the error message about proxies. But if you were still getting that proxy message, you would not know that the file comes out as HTML, so it seems you have solved the problem of fetching the HTML page and writing it to a file.

If I have misunderstood, then what do you see as the difference between that existing question and this question?

on 9 Jun 2013

Yeah proxy error got resolved.....

However, I don't want HTML coding as my output, instead the contents of the webpage to be saved in text file.

Hence URLWRITE or URLREAD is not working in this case.

urlread() and urlwrite() are doing their intended purpose, fetching the page as-is. Processing the page afterwards is the responsibility of your code.

Do you blame your automobile for the fact that when you go grocery shopping, the automobile does not bring the groceries into the kitchen and take them out of the grocery bags?


No products are associated with this question.

1 Answer

Answer by Walter Roberson
on 7 Jun 2013

You will have to parse the text. The page you referenced before does not have the text presented in any simple way. Individual letters of the text are each surrounded <font> controls that select the color for the letters.


on 7 Jun 2013

How shall I parse it ?

Any of the standard techniques, including:

  • fread() a character at a time and have a bunch of ad-hoc code to figure out what to do with it
  • fread() a character at a time and use it to trigger a transition in a carefully constructed state machine
  • fgets() or fgetl() a line at a time and use basic string manipulation techniques such as find() or strfind() or ismember() or switch/case
  • fileread() or textscan() or fread() the entire file and use the basic techniques on the file that is now completely in memory
  • Use regexp() or regexprep() to process the file that is completely in memory
  • make a call to perl() with a perl script to do the work, perhaps having loaded in an HTML stripping routine from
  • On Linux or OS-X machines, shell out to ed or sed or nawk to do the work
  • write a C program to do the work
  • write a lex grammar to do the tokenizing. Write a yacc routine to express the BNF and take appropriate actions
  • find some Java library that does for HTML roughly what is done for xml in

and so on.

on 9 Jun 2013

thanks a lot.

I am trying using shell scripting.

Join the 15-year community celebration.

Play games and win prizes!

Learn more
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

MATLAB Academy

New to MATLAB?

Learn MATLAB today!