File Exchange

image thumbnail

Extract text from a PDF document

version 1.0 (164 KB) by

(if you are lucky)

3.93333
15 Ratings

45 Downloads

Updated

View License

The submission calls on PDFTextStripper class of Ben Litchfield's PDFBox Java library to extract text from a PDF document.
1. Download PDFBox library from http://sourceforge.net/projects/pdfbox/
2. Download FontBox library from http://sourceforge.net/projects/fontbox/
3. Modify the file paths in pdfParseDemo.m
4. Enable cell mode and step through pdfParseDemo.m

The code does not handle files that have 'Content Copying' permission protected by a password; collaboration to remedy the issue is enthusiastically welcomed!

Comments and Ratings (20)

Harivinod N

Gauvain

Hello!
I only use a part of this code to extract data from the pdf, and I added a "pdfdoc.close" to avoid error message. Here is the code I am using:

function [pdfstr] = pdftotext(filename)

% filename is the path to the pdf file

%%
pdfdoc = org.apache.pdfbox.pdmodel.PDDocument;
reader = org.apache.pdfbox.util.PDFTextStripper;
pdfdoc.close;
%%
pdfdoc = pdfdoc.load(filename);
pdfdoc.isEncrypted;

%% text, with planty of padding
pdfstr = reader.getText(pdfdoc);

pdfdoc.close;

end

Finally I write the pdfstr into a textfile, using this short code:

% write pdfstr in the file text.txt
fid = fopen('text.txt','w');
fprintf(fid,'%s',pdfstr);
fclose(fid); %close text.txt

Hope this will help some of you!

Thank you Dimitri for this code!

Works great with the modifications by Klaus Villforth. Big thumbs up to Dimitri Shvorob. Now I will move on to table data extraction from PDFs (I know it's an uphill task that).

Works well with pdfbox-1.8.12.jar and fontbox-1.8.12.jar by setting
pdfdoc = org.apache.pdfbox.pdmodel.PDDocument;
reader = org.apache.pdfbox.util.PDFTextStripper;

Since pdfbox needs fontbox, introduce javaaddpath for both libraries initially.

Close the file with pdfdoc.close in order to prevent the warning: "You did not close a PDF Document"

This looks extremely promising, but I am encountering the same error as Dogancan

hi ı cant text from a PDF document
Error in pdfParseDemo (line 12)
pdfdoc = org.pdfbox.pdmodel.PDDocument;

Undefined variable "org" or function "org.pdfbox.pdmodel.PDDocument".

what could the problem be? ı dont know using java

Alba Schafer

Thanks for the submission.

But I have an error with this line:

Error in ==> Untitled at 20
pdfstr = reader.getText(pdfdoc) %#ok

It gives no more information, what could the problem be?

Thanks

Shahar

Shahar (view profile)

Thanks for the great submission.

If you want to get rid of the annoying 'You did not close the PDF Document' error,
make sure you close the pdf after getting the final pafstr, with pdfdoc.close (see below).

%% text 'unpadded'
pdfstr = deblank(pdfstr) %#ok

%% close [ADDED]
pdfdoc.close;

it makes problem with me...Error occured
??? Java exception occurred:
java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser

at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)

at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)

at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)

at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)

Error in ==> Untitled at 20
pdfstr = reader.getText(pdfdoc) %#ok

java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

Kindly fix it for me...thanks

Holden

Holden (view profile)

There is a really simple yet robust tool for extracting highlights and notes from your pdf-files available at: http://www.sumnotes.net . Not only it supports various advanced features like selective extraction or predictive extraction, but it also allows you to save extracted highlights into TXT or DOC files. All desktop browsers and operating systems are supported. We are in cloud, so no installation is needed. And yes, it is for free. Try it out.

Quan Wang

Quan Wang (view profile)

Nice work. It would be better if you can handle the java warnings. For example, you have "pdfdoc" variable defined for different tasks. You should use different variables. Also, you need to close the java object in your demo.

Jud

Jud (view profile)

This is a decent program, but if you are using Linux, there is a MUCH simpler way to accomplish the exact same thing.

Install the program "pdftotext", then use it inside of Matlab to convert a PDF to a text file. Then read in the text file. Here's how it might look:

inputPDF = 'test.pdf';
outputfile = 'output.txt';
cmd = ['pdftotext -raw ',inputPDF,' ',outputfile]
system(cmd);
fid = fopen(outputfile);
alltext = textscan(fid,'%s','Delimiter','\n');
fclose(fid);

Jud

Jud (view profile)

Ergina

Ergina (view profile)

It worked fine for me, however how can I extract color information for the characters in pdf?

mathworks2011

V Poor. Does not work.

The author even notes it does not work inside the m-file!

java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

update to the most recent version 1.3.1 and then
error messages disappear when you use
pdfDoc.getDocument().close;

Apparently, pdfbox creates two objects on instantiation and loading respectively. I got rid of the warning mentioned in the code by using "pdfdoc = org.apache.pdfbox.pdmodel.PDDocument.load(filename)" directly (also, .apache added since newer releases of pdfbox).

Also, after the pdfdoc variable is created (inside a try..catch), "pdfdoc.close()" must also be called.

Zach Koval

I am lucky, I guess. Worked ok, except the warning that Dimitri mentions.

Updates

1.0

BSD

MATLAB Release
MATLAB 7.4 (R2007a)

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video