How to batch convert Microsoft Office Word documents to txt?

38 views (last 30 days)
I would like to convert a list of office word documents to text files, in order to extract the text content later.
I am trying things like
word = actxserver('Word.Application'); %this line is ok
document = word.Documents.Open('F:\mytext.docx'); %this line is ok
invoke(document,'SaveAs',{'test003.txt','wdTXT'}); %this line is not ok;
invoke(document,'SaveAs','test00.doc',1); % this line would be ok but it's not what I need.
but the last line is not correct, any suggestion?
Kind regards
Rodolphe, Zurich, Switzerland

Answers (1)

Image Analyst
Image Analyst on 15 Apr 2023
Edited: Image Analyst on 15 Apr 2023
You have the format wrong and the function wrong. It should be a single filename to .SaveAs2 with format 16. How do I know? I recorded a macro in Word and then opened up the macro in Word and saw how Word itself did it.
Try this. It worked for me. However my demo converts from txt to docx. It should be straightforward to reverse it to go from docx to txt, though you may have to deal with popups telling you that some information will be lost.
% Demo by Image Analyst convert txt files into Microsoft Word .docx documents using the ActiveX server for Windows.
clc; % Clear the command window.
close all; % Close all figures (except those of imtool.)
clear; % Erase all existing variables. Or clearvars if you want.
workspace; % Make sure the workspace panel is showing.
format short g;
format compact;
% Create an instance of the ActiveX server.
word = actxserver('Word.Application'); % This line is ok
% Make it visible so we might see any potential error messages.
word.visible = true;
% Process a sequence of text files, converting them to .docx files.
folder = pwd;
filePattern = fullfile(pwd, '*.txt'); % Adapt as needed to whatever you want.
fileList = dir(filePattern) % Create a list of files to process.
% Get all filenames into one cell array. Filenames have the complete path (folder prepended).
allFileNames = fullfile(folder, {fileList.name});
numFiles = numel(allFileNames); % Count the files.
for k = 1 : numFiles
% Get this file name.
inputFullFileName = allFileNames{k};
% Get the different file parts.
[folder, baseFileNameNoExt, ext] = fileparts(inputFullFileName);
fprintf('\nProcessing "%s"\n', inputFullFileName);
% Now do something with fullFileName, such as passing it to imread.
document = word.Documents.Open(inputFullFileName); %this line is ok
% If it hangs here, check for, a possibly hidden, dialog box asking you
% to open the file as a copy because it's already open.
% Also open the Task Manager with Ctrl-Shift-Esc and see if "Microsoft Word"
% is in either the "Apps" list or "Background Processes" list.
% If it is, end the process(es).
fprintf(' Success opening "%s"\n', baseFileNameNoExt);
% Create output file name
outputFullFileName = fullfile(folder, [baseFileNameNoExt, '.docx']);
fprintf(' Converting to "%s"\n', outputFullFileName);
% Save it to a .docx format file. Recording a macro to Word shows this:
% ActiveDocument.SaveAs2 FileName:="Calcite.docx", FileFormat:= _
% wdFormatXMLDocument, LockComments:=False, Password:="", AddToRecentFiles _
% :=True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts _
% :=False, SaveNativePictureFormat:=False, SaveFormsData:=False, _
% SaveAsAOCELetter:=False, CompatibilityMode:=15
% Get formats here: https://learn.microsoft.com/en-us/office/vba/api/word.wdsaveformat
document.SaveAs2(outputFullFileName, 16); % 16 means the default DOCX format.
% Close this particular document but not the Word server.
document.Close;
end
% Delete the ActiveX server.
delete(word);
fprintf('Done processing %d files.\n', numFiles);

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!