View License

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video

Highlights from
Extract text from a PDF document

Join the 15-year community celebration.

Play games and win prizes!

» Learn more

4.0
4.0 | 13 ratings Rate this file 43 Downloads (last 30 days) File Size: 164 KB File ID: #19798 Version: 1.0
image thumbnail

Extract text from a PDF document

by

 

02 May 2008 (Updated )

(if you are lucky)

| Watch this File

File Information
Description

The submission calls on PDFTextStripper class of Ben Litchfield's PDFBox Java library to extract text from a PDF document.
1. Download PDFBox library from http://sourceforge.net/projects/pdfbox/
2. Download FontBox library from http://sourceforge.net/projects/fontbox/
3. Modify the file paths in pdfParseDemo.m
4. Enable cell mode and step through pdfParseDemo.m

The code does not handle files that have 'Content Copying' permission protected by a password; collaboration to remedy the issue is enthusiastically welcomed!

MATLAB release MATLAB 7.4 (R2007a)
MATLAB Search Path
/
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (18)
29 Jun 2016 Rakshith Mukunda Rao

Works great with the modifications by Klaus Villforth. Big thumbs up to Dimitri Shvorob. Now I will move on to table data extraction from PDFs (I know it's an uphill task that).

13 Jun 2016 Fabricio Pereira  
03 May 2016 Klaus Villforth

Works well with pdfbox-1.8.12.jar and fontbox-1.8.12.jar by setting
pdfdoc = org.apache.pdfbox.pdmodel.PDDocument;
reader = org.apache.pdfbox.util.PDFTextStripper;

Since pdfbox needs fontbox, introduce javaaddpath for both libraries initially.

Close the file with pdfdoc.close in order to prevent the warning: "You did not close a PDF Document"

26 Apr 2016 Martin Pitt-Bradley

This looks extremely promising, but I am encountering the same error as Dogancan

19 Jan 2016 dogancan yüksel

hi ı cant text from a PDF document
Error in pdfParseDemo (line 12)
pdfdoc = org.pdfbox.pdmodel.PDDocument;

Undefined variable "org" or function "org.pdfbox.pdmodel.PDDocument".

what could the problem be? ı dont know using java

Comment only
05 Jan 2016 Alba Schafer

Thanks for the submission.

But I have an error with this line:

Error in ==> Untitled at 20
pdfstr = reader.getText(pdfdoc) %#ok

It gives no more information, what could the problem be?

Thanks

Comment only
01 Jan 2015 Shahar

Shahar (view profile)

Thanks for the great submission.

If you want to get rid of the annoying 'You did not close the PDF Document' error,
make sure you close the pdf after getting the final pafstr, with pdfdoc.close (see below).

%% text 'unpadded'
pdfstr = deblank(pdfstr) %#ok

%% close [ADDED]
pdfdoc.close;

24 Aug 2014 azizullah khan

it makes problem with me...Error occured
??? Java exception occurred:
java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser

at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)

at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)

at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)

at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)

Error in ==> Untitled at 20
pdfstr = reader.getText(pdfdoc) %#ok

java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

Kindly fix it for me...thanks

Comment only
30 Oct 2013 Holden

Holden (view profile)

There is a really simple yet robust tool for extracting highlights and notes from your pdf-files available at: http://www.sumnotes.net . Not only it supports various advanced features like selective extraction or predictive extraction, but it also allows you to save extracted highlights into TXT or DOC files. All desktop browsers and operating systems are supported. We are in cloud, so no installation is needed. And yes, it is for free. Try it out.

17 Sep 2013 Quan Wang

Quan Wang (view profile)

Nice work. It would be better if you can handle the java warnings. For example, you have "pdfdoc" variable defined for different tasks. You should use different variables. Also, you need to close the java object in your demo.

23 May 2013 Jud

Jud (view profile)

This is a decent program, but if you are using Linux, there is a MUCH simpler way to accomplish the exact same thing.

Install the program "pdftotext", then use it inside of Matlab to convert a PDF to a text file. Then read in the text file. Here's how it might look:

inputPDF = 'test.pdf';
outputfile = 'output.txt';
cmd = ['pdftotext -raw ',inputPDF,' ',outputfile]
system(cmd);
fid = fopen(outputfile);
alltext = textscan(fid,'%s','Delimiter','\n');
fclose(fid);

Comment only
23 May 2013 Jud

Jud (view profile)

 
09 Nov 2012 Ergina

Ergina (view profile)

It worked fine for me, however how can I extract color information for the characters in pdf?

26 Jul 2011 mathworks2011

V Poor. Does not work.

The author even notes it does not work inside the m-file!

java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

02 Nov 2010 Brahim HAMADICHAREF

update to the most recent version 1.3.1 and then
error messages disappear when you use
pdfDoc.getDocument().close;

23 Jun 2010 Jonatan Olofsson

Apparently, pdfbox creates two objects on instantiation and loading respectively. I got rid of the warning mentioned in the code by using "pdfdoc = org.apache.pdfbox.pdmodel.PDDocument.load(filename)" directly (also, .apache added since newer releases of pdfbox).

Also, after the pdfdoc variable is created (inside a try..catch), "pdfdoc.close()" must also be called.

10 May 2008 Zach Koval

I am lucky, I guess. Worked ok, except the warning that Dimitri mentions.

05 May 2008 Dimitri Shvorob

See also this submission:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=17839&objectType=file

Comment only
Updates
04 Apr 2016 1.0

BSD

Contact us