This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Text Data Preparation

Import text data into MATLAB®and preprocess it for analysis.

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. For an example showing how to get started, see Prepare Text Data for Analysis.

Text Analytics Toolbox supports the languages English and Japanese. Most Text Analytics Toolbox functions work with text from other languages. For more information, see Language Support.

Functions

expand all

extractFileTextRead text from PDF, Microsoft Word, HTML, and plain text files
extractHTMLTextExtract text from HTML
readPDFFormDataRead data from PDF forms
writeTextDocumentWrite documents to text file
htmlTreeParsed HTML tree
findElementFind elements in HTML tree
getAttributeRead HTML attribute of root node of HTML tree
ismissingFind HTML trees without values
tokenizedDocumentArray of tokenized documents for text analysis
erasePunctuationErase punctuation from text and documents
eraseTagsErase HTML and XML tags from text
eraseURLsErase HTTP and HTTPS URLs from text
removeStopWordsRemove stop words from documents
removeShortWordsRemove short words from documents or bag-of-words model
removeLongWordsRemove long words from documents or bag-of-words model
removeWordsRemove selected words from documents or bag-of-words model
normalizeWordsStem or lemmatize words
stopWordsList of stop words
decodeHTMLEntitiesConvert HTML and XML entities into characters
lowerConvert documents to lowercase
upperConvert documents to uppercase
contextSearch documents for word occurrences in context
tokenDetailsDetails of tokens in tokenized document array
addSentenceDetailsAdd sentence numbers to documents
addPartOfSpeechDetailsAdd part-of-speech tags to documents
addLemmaDetailsAdd lemma forms of tokens to documents
addLanguageDetailsAdd language identifiers to documents
addTypeDetailsAdd token type details to documents
splitSentencesSplit text into sentences
corpusLanguageDetect language of text
abbreviationsTable of common abbreviations
topLevelDomainsList of top-level domains
bagOfWordsBag-of-words model
bagOfNgramsBag-of-n-grams model
addDocumentAdd documents to bag-of-words or bag-of-n-grams model
removeDocumentRemove documents from bag-of-words or bag-of-n-grams model
removeInfrequentWordsRemove words with low counts from bag-of-words model
removeInfrequentNgramsRemove infrequently seen n-grams from bag-of-n-grams model
removeNgramsRemove n-grams from bag-of-n-grams model
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
topkwordsMost important words in bag-of-words model or LDA topic
topkngramsMost frequent n-grams
encodeEncode documents as matrix of word or n-gram counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
joinCombine multiple bag-of-words or bag-of-n-grams models
docfunApply function to words in documents
plusAppend documents
replaceFind and replace substrings in documents
regexprepReplace text in words of documents using regular expression
doclengthLength of documents in document array
doc2cellConvert documents to cell array of string vectors
joinWordsConvert documents to string by joining words
stringConvert scalar document to string vector

Examples and How To

Extract Text Data from Files

This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.

Prepare Text Data for Analysis

This example shows how to create a function which cleans and preprocesses text data for analysis.

Parse HTML and Extract Text Content

This example shows how to parse HTML code and extract the text content from particular elements.

Analyze Text Data Containing Emojis

This example shows how to analyze text data containing emojis.

Concepts

Language Support

Information on language support in Text Analytics Toolbox.

Japanese Language Support

Information on Japanese support in Text Analytics Toolbox.

Analyze Japanese Text Data

This example shows how to import, prepare, and analyze Japanese text data using a topic model.

Featured Examples