Main Content

Extract Text Data from Files

This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.

Usually, the easiest way to import text data into MATLAB is to use the extractFileText function. This function extracts the text data from text, PDF, HTML, and Microsoft Word files. To import text from CSV and Microsoft Excel files, use readtable. To extract text from HTML code, use extractHTMLText. To read data from PDF forms, use readPDFFormData.

Text File

Extract the text from sonnets.txt using extractFileText. The file sonnets.txt contains Shakespeare's sonnets in plain text.

filename = "sonnets.txt";
str = extractFileText(filename);

View the first sonnet by extracting the text between the two titles "I" and "II".

start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 = 
    "
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
      "

For text files containing multiple documents seperated by newline characters, use the readlines function.

filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
    "From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
    "When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
    "Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."

Microsoft Word Document

Extract the text from sonnets.docx using extractFileText. The file exampleSonnets.docx contains Shakespeare's sonnets in a Microsoft Word document.

filename = "exampleSonnets.docx";
str = extractFileText(filename);

View the second sonnet by extracting the text between the two titles "II" and "III".

start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
     
       And dig deep trenches in thy beauty's field,
     
       Thy youth's proud livery so gazed on now,
     
       Will be a tatter'd weed of small worth held:
     
       Then being asked, where all thy beauty lies,
     
       Where all the treasure of thy lusty days;
     
       To say, within thine own deep sunken eyes,
     
       Were an all-eating shame, and thriftless praise.
     
       How much more praise deserv'd thy beauty's use,
     
       If thou couldst answer 'This fair child of mine
     
       Shall sum my count, and make my old excuse,'
     
       Proving his beauty by succession thine!
     
         This were to be new made when thou art old,
     
         And see thy blood warm when thou feel'st it cold.
     
      "

The example Microsoft Word document uses two newline characters between each line. To replace these characters with a single newline character, use the replace function.

sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

PDF Files

Extract text from PDF documents and data from PDF forms.

PDF Document

Extract the text from sonnets.pdf using extractFileText. The file exampleSonnets.pdf contains Shakespeare's sonnets in a PDF.

filename = "exampleSonnets.pdf";
str = extractFileText(filename);

View the third sonnet by extracting the text between the two titles "III" and "IV". This PDF has a space before each newline character.

start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 = 
    " 
       Look in thy glass and tell the face thou viewest 
       Now is the time that face should form another; 
       Whose fresh repair if now thou not renewest, 
       Thou dost beguile the world, unbless some mother. 
       For where is she so fair whose unear'd womb 
       Disdains the tillage of thy husbandry? 
       Or who is he so fond will be the tomb, 
       Of his self-love to stop posterity? 
       Thou art thy mother's glass and she in thee 
       Calls back the lovely April of her prime; 
       So thou through windows of thine age shalt see, 
       Despite of wrinkles this thy golden time. 
         But if thou live, remember'd not to be, 
         Die single and thine image dies with thee. 
     
     
      
       "

PDF Form

To read text data from PDF forms, use readPDFFormData. The function returns a struct containing the data from the PDF form fields.

filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
         event_type: "Thunderstorm Wind"
    event_narrative: "Large tree down between Plantersville and Nettleton."

HTML

Extract text from HTML files, HTML code, and the web.

HTML File

To extract text data from a saved HTML file, use extractFileText.

filename = "exampleSonnets.html";
str = extractFileText(filename);

View the forth sonnet by extracting the text between the two titles "IV" and "V".

start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 = 
    "
     Unthrifty loveliness, why dost thou spend
     Upon thy self thy beauty's legacy?
     Nature's bequest gives nothing, but doth lend,
     And being frank she lends to those are free:
     Then, beauteous niggard, why dost thou abuse
     The bounteous largess given thee to give?
     Profitless usurer, why dost thou use
     So great a sum of sums, yet canst not live?
     For having traffic with thy self alone,
     Thou of thy self thy sweet self dost deceive:
     Then how when nature calls thee to be gone,
     What acceptable audit canst thou leave?
     Thy unused beauty must be tombed with thee,
     Which, used, lives th' executor to be.
     "

HTML Code

To extract text data from a string containing HTML code, use extractHTMLText.

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     
     by William Shakespeare"

From the Web

To extract text data from a web page, first read the HTML code using webread, and then use extractHTMLText.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 
    'Text Analytics Toolbox
     
     Analyze and model text data 
     
     Release Notes
     
     PDF Documentation
     
     Release Notes
     
     PDF Documentation
     
     Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.
     
     Get Started
     
     Learn the basics of Text Analytics Toolbox
     
     Text Data Preparation
     
     Import text data into MATLAB® and preprocess it for analysis
     
     Modeling and Prediction
     
     Develop predictive models using topic models and word embeddings
     
     Display and Presentation
     
     Visualize text data and models using word clouds and text scatter plots
     
     Language Support
     
     Information on language support in Text Analytics Toolbox'

Parse HTML Code

To find particular elements of HTML code, parse the code using htmlTree and use findElement. Parse the HTML code and find all the hyperlinks. The hyperlinks are nodes with element name "A".

tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);

View the first 10 subtrees and extract the text using extractHTMLText.

subtrees(1:10)
ans = 
  10×1 htmlTree:

    <A class="skip_link sr-only" href="#skip_link_anchor">Skip to content</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>

str = extractHTMLText(subtrees);

View the extracted text of the first 10 hyperlinks.

str(1:10)
ans = 10×1 string
    "Skip to content"
    ""
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Get MATLAB"
    ""

To get the link targets, use getAttributes and specify the attribute "href" (hyperlink reference). Get the link targets of the first 10 subtrees.

attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
    "#skip_link_anchor"
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
    "https://www.mathworks.com?s_tid=gn_logo"

CSV and Microsoft Excel Files

To extract text data from CSV and Microsoft Excel files, use readtable and extract the text data from the table that it returns.

Extract the table data from factoryReposts.csv using the readtable function and view the first few rows of the table.

T = readtable('factoryReports.csv','TextType','string');
head(T)
                                 Description                                       Category          Urgency          Resolution         Cost 
    _____________________________________________________________________    ____________________    ________    ____________________    _____

    "Items are occasionally getting stuck in the scanner spools."            "Mechanical Failure"    "Medium"    "Readjust Machine"         45
    "Loud rattling and banging sounds are coming from assembler pistons."    "Mechanical Failure"    "Medium"    "Readjust Machine"         35
    "There are cuts to the power when starting the plant."                   "Electronic Failure"    "High"      "Full Replacement"      16200
    "Fried capacitors in the assembler."                                     "Electronic Failure"    "High"      "Replace Components"      352
    "Mixer tripped the fuses."                                               "Electronic Failure"    "Low"       "Add to Watch List"        55
    "Burst pipe in the constructing agent is spraying coolant."              "Leak"                  "High"      "Replace Components"      371
    "A fuse is blown in the mixer."                                          "Electronic Failure"    "Low"       "Replace Components"      441
    "Things continue to tumble off of the belt."                             "Mechanical Failure"    "Low"       "Readjust Machine"         38

Extract the text data from the event_narrative column and view the first few strings.

str = T.Description;
str(1:10)
ans = 10×1 string
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."
    "Burst pipe in the constructing agent is spraying coolant."
    "A fuse is blown in the mixer."
    "Things continue to tumble off of the belt."
    "Falling items from the conveyor belt."
    "The scanner reel is split, it will soon begin to curve."

Extract Text from Multiple Files

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The example files are named "exampleSonnetN.txt", where N is the number of the sonnet. Specify the file name using the wildcard "*" to find all file names of this structure. To specify the read function to be extractFileText, input this function to fileDatastore using a function handle.

location = "exampleSonnet*.txt";
fds = fileDatastore(location,'ReadFcn',@extractFileText);

Loop over the files in the datastore and read each text file.

str = [];
while hasdata(fds)
    textData = read(fds);
    str = [str; textData];
end

View the extracted text.

str
str = 4×1 string
    "  From fairest creatures we desire increase,↵  That thereby beauty's rose might never die,↵  But as the riper should by time decease,↵  His tender heir might bear his memory:↵  But thou, contracted to thine own bright eyes,↵  Feed'st thy light's flame with self-substantial fuel,↵  Making a famine where abundance lies,↵  Thy self thy foe, to thy sweet self too cruel:↵  Thou that art now the world's fresh ornament,↵  And only herald to the gaudy spring,↵  Within thine own bud buriest thy content,↵  And tender churl mak'st waste in niggarding:↵    Pity the world, or else this glutton be,↵    To eat the world's due, by the grave and thee."
    "  When forty winters shall besiege thy brow,↵  And dig deep trenches in thy beauty's field,↵  Thy youth's proud livery so gazed on now,↵  Will be a tatter'd weed of small worth held:↵  Then being asked, where all thy beauty lies,↵  Where all the treasure of thy lusty days;↵  To say, within thine own deep sunken eyes,↵  Were an all-eating shame, and thriftless praise.↵  How much more praise deserv'd thy beauty's use,↵  If thou couldst answer 'This fair child of mine↵  Shall sum my count, and make my old excuse,'↵  Proving his beauty by succession thine!↵    This were to be new made when thou art old,↵    And see thy blood warm when thou feel'st it cold."
    "  Look in thy glass and tell the face thou viewest↵  Now is the time that face should form another;↵  Whose fresh repair if now thou not renewest,↵  Thou dost beguile the world, unbless some mother.↵  For where is she so fair whose unear'd womb↵  Disdains the tillage of thy husbandry?↵  Or who is he so fond will be the tomb,↵  Of his self-love to stop posterity?↵  Thou art thy mother's glass and she in thee↵  Calls back the lovely April of her prime;↵  So thou through windows of thine age shalt see,↵  Despite of wrinkles this thy golden time.↵    But if thou live, remember'd not to be,↵    Die single and thine image dies with thee."
    "  Unthrifty loveliness, why dost thou spend↵  Upon thy self thy beauty's legacy?↵  Nature's bequest gives nothing, but doth lend,↵  And being frank she lends to those are free:↵  Then, beauteous niggard, why dost thou abuse↵  The bounteous largess given thee to give?↵  Profitless usurer, why dost thou use↵  So great a sum of sums, yet canst not live?↵  For having traffic with thy self alone,↵  Thou of thy self thy sweet self dost deceive:↵  Then how when nature calls thee to be gone,↵  What acceptable audit canst thou leave?↵    Thy unused beauty must be tombed with thee,↵    Which, used, lives th' executor to be."

See Also

| | | |

Related Topics