Documentation

Text Data Preparation

Import text data into MATLAB®and preprocess it for analysis.

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.

Functions

extractFileTextRead text from PDF, Microsoft Word, and plain text files
writeTextDocumentWrite documents to text file
eraseTagsErase HTML and XML tags from text
eraseURLsErase HTTP and HTTPS URLs from text
erasePunctuationErase punctuation from text and documents
decodeHTMLEntitiesConvert HTML and XML entities into characters
normalizeWordsRemove inflections from words using the Porter stemmer
removeLongWordsRemove long words from documents or bag-of-words model
removeShortWordsRemove short words from documents or bag-of-words model
removeWordsRemove selected words from document or bag-of-words model
stopWordsStop word list
upperConvert documents to uppercase
lowerConvert documents to lowercase
docfunApply function to words in documents
replaceFind and replace substrings in documents
regexprepReplace text in words of documents using regular expression
addDocumentAdd documents to bag-of-words model
removeDocumentRemove documents from bag-of-words model
removeEmptyDocumentsRemove empty documents from tokenized document array or bag-of-words model
removeInfrequentWordsRemove words with low counts from bag-of-words model
topkwordsMost important words in bag-of-words model or LDA topic
encodeEncode documents as matrix of word counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
contextSearch documents for word occurrences in context
doclengthLength of documents in document array
doc2cellConvert documents to cell array of string vectors
joinWordsConvert documents to string by joining words
stringConvert scalar document to string vector

Using Objects

tokenizedDocumentArray of tokenized documents
bagOfWordsBag-of-words model

Topics

Extract Text Data From Files

This example shows how to extract the text data from text, Microsoft Word, PDF, CSV, and Microsoft Excel files and import it into MATLAB for analysis.

Prepare Text Data for Analysis

This example shows how to create a function which cleans and preprocesses text data for analysis.

Create Simple Text Model for Classification

This example shows how to train a simple text classifier on word frequency counts using a bag-of-words model.

Featured Examples

Was this topic helpful?