No BSD License  

Highlights from
Extract text from a PDF document

3.625

3.6 | 8 ratings Rate this file 79 Downloads (last 30 days) File Size: 164 KB File ID: #19798
image thumbnail

Extract text from a PDF document

by

 

02 May 2008 (Updated )

(if you are lucky)

| Watch this File

File Information
Description

The submission calls on PDFTextStripper class of Ben Litchfield's PDFBox Java library to extract text from a PDF document.

1. Download PDFBox library from http://sourceforge.net/projects/pdfbox/
2. Download FontBox library from http://sourceforge.net/projects/fontbox/
3. Modify the file paths in pdfParseDemo.m
4. Enable cell mode and step through pdfParseDemo.m

The code does not handle files that have 'Content Copying' permission protected by a password; collaboration to remedy the issue is enthusiastically welcomed!

MATLAB release MATLAB 7.4 (R2007a)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (10)
30 Oct 2013 Holden

There is a really simple yet robust tool for extracting highlights and notes from your pdf-files available at: http://www.sumnotes.net . Not only it supports various advanced features like selective extraction or predictive extraction, but it also allows you to save extracted highlights into TXT or DOC files. All desktop browsers and operating systems are supported. We are in cloud, so no installation is needed. And yes, it is for free. Try it out.

17 Sep 2013 Quan Wang

Nice work. It would be better if you can handle the java warnings. For example, you have "pdfdoc" variable defined for different tasks. You should use different variables. Also, you need to close the java object in your demo.

23 May 2013 Jud

This is a decent program, but if you are using Linux, there is a MUCH simpler way to accomplish the exact same thing.

Install the program "pdftotext", then use it inside of Matlab to convert a PDF to a text file. Then read in the text file. Here's how it might look:

inputPDF = 'test.pdf';
outputfile = 'output.txt';
cmd = ['pdftotext -raw ',inputPDF,' ',outputfile]
system(cmd);
fid = fopen(outputfile);
alltext = textscan(fid,'%s','Delimiter','\n');
fclose(fid);

23 May 2013 Jud  
09 Nov 2012 Ergina

It worked fine for me, however how can I extract color information for the characters in pdf?

26 Jul 2011 mathworks2011

V Poor. Does not work.

The author even notes it does not work inside the m-file!

java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

02 Nov 2010 Brahim HAMADICHAREF

update to the most recent version 1.3.1 and then
error messages disappear when you use
pdfDoc.getDocument().close;

23 Jun 2010 Jonatan Olofsson

Apparently, pdfbox creates two objects on instantiation and loading respectively. I got rid of the warning mentioned in the code by using "pdfdoc = org.apache.pdfbox.pdmodel.PDDocument.load(filename)" directly (also, .apache added since newer releases of pdfbox).

Also, after the pdfdoc variable is created (inside a try..catch), "pdfdoc.close()" must also be called.

10 May 2008 Zach Koval

I am lucky, I guess. Worked ok, except the warning that Dimitri mentions.

05 May 2008 Dimitri Shvorob

See also this submission:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=17839&objectType=file

Contact us