Extract Text from Multiple PDF files across Multiple Folders

When you have desired text in PDF documents saved in many folders
82 Downloads
Updated 10 Sep 2019

View License

Hello! I put this code together because my company had a backlog of individual quotes we wanted to store in a single excel file. I'm sharing this code because we did not wish to pay for a conventional PDF text reader.

This code was primarily put together from two sources. I recommend reading the first link as you must download the toolbox for this code to work.
Extract text from single document*: https://www.mathworks.com/matlabcentral/fileexchange/19798-extract-text-from-a-pdf-document
Open files in multiple folders: https://www.mathworks.com/matlabcentral/answers/245959-how-to-read-text-files-from-different-sub-folders-in-a-folder

1. Download the code
2. Insert the file path for your PDFbox (line 10)
3. Pre-allocate a number of cells for an approximate number of PDFs you are trying to read (line 15)
4. Change your output text file name if you wish (line 99)
5. The default is for all PDF files in the chosen directory to be read. If you only wish to open files with a certain file pattern, adjust line 48.

General heads-ups:
a. I'm not a prolific coder, there's certainly some junk code in there!
b. Some users claim the PDF reader code dosen't work for them. It worked excellent my first time.
c. If you wish to write a separate text file for each PDF, bring the file write into the above loop
d. This does not process password protected files
e. There will be PDF java errors upon running this. You can ignore them.
f. I was not successful reading all my files. ~5% of them triggered the try/catch statement. Let me know if you figure out why!

*Note: there are newer versions of the toolbox than what is linked here. I cannot confirm if they are compatible or work better, although users appear to have success.

Cite As

Samuel Veith (2024). Extract Text from Multiple PDF files across Multiple Folders (https://www.mathworks.com/matlabcentral/fileexchange/72706-extract-text-from-multiple-pdf-files-across-multiple-folders), MATLAB Central File Exchange. Retrieved .

MATLAB Release Compatibility
Created with R2019a
Compatible with any release
Platform Compatibility
Windows macOS Linux

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Published Release Notes
1.0.0