3.75

3.8 | 4 ratings Rate this file 30 Downloads (last 30 days) File Size: 164.1 KB File ID: #19798
image thumbnail

Extract text from a PDF document

by Dimitri Shvorob

 

02 May 2008 (Updated 05 May 2008)

(if you are lucky)

| Watch this File

File Information
Description

The submission calls on PDFTextStripper class of Ben Litchfield's PDFBox Java library to extract text from a PDF document.

1. Download PDFBox library from http://sourceforge.net/projects/pdfbox/
2. Download FontBox library from http://sourceforge.net/projects/fontbox/
3. Modify the file paths in pdfParseDemo.m
4. Enable cell mode and step through pdfParseDemo.m

The code does not handle files that have 'Content Copying' permission protected by a password; collaboration to remedy the issue is enthusiastically welcomed!

MATLAB release MATLAB 7.4 (R2007a)
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (5)
05 May 2008 Dimitri Shvorob

See also this submission:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=17839&objectType=file

10 May 2008 Zach Koval

I am lucky, I guess. Worked ok, except the warning that Dimitri mentions.

23 Jun 2010 Jonatan Olofsson

Apparently, pdfbox creates two objects on instantiation and loading respectively. I got rid of the warning mentioned in the code by using "pdfdoc = org.apache.pdfbox.pdmodel.PDDocument.load(filename)" directly (also, .apache added since newer releases of pdfbox).

Also, after the pdfdoc variable is created (inside a try..catch), "pdfdoc.close()" must also be called.

02 Nov 2010 Brahim HAMADICHAREF

update to the most recent version 1.3.1 and then
error messages disappear when you use
pdfDoc.getDocument().close;

26 Jul 2011 mathworks2011

V Poor. Does not work.

The author even notes it does not work inside the m-file!

java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

Please login to add a comment or rating.
Tag Activity for this File
Tag Applied By Date/Time
data import Dimitri Shvorob 22 Oct 2008 09:59:57
data export Dimitri Shvorob 22 Oct 2008 09:59:57
adobe acrobat Cristina McIntire 05 Feb 2009 13:25:39
pdf parse Cristina McIntire 05 Feb 2009 13:25:39
extract text Cristina McIntire 05 Feb 2009 13:25:39

Contact us at files@mathworks.com