How to Read PDF file in Matlab?

I want to read pdf file and make some changes in it and then save them in excel.... I have tried my best but fail every time....Need your help....Any effort will be greatly appreciated..Thanks in advance.....

20 Comments

What kind of changes do you want to make to the PDF that you wish to then save to Excel? What is the code that you have written so far?
i want to capture some data...and i havn't written code up till now...My 1st step is to read pdf file...........thanks for comments.
Geoff Hayes thanks for comments. Please just give me a clue how i can be possible to read pdf files...I am waiting for your response..
azizullah - I noticed that you looked at Dimitri Shvorob's extract text from PDF on the MATLAB File Exchange, but you had some problems with it. Did you download the two libraries that are needed for this submission, and modify the pdfParseDemo.m file as per the author's instructions?
One of the comments in the above submission indicates that there is a utility called pdftotext that you may be able to call from within the MATLAB code. Have you looked in to this?
What is your goal with this? It might be that Matlab is not the best tool for this.
yes i have done which was required but pdfParsedemo makes a problem with me...
thanks Jose-Luis:MY goal is to capture data from pdf file and save the data to excel (the capture data)...
Is there just one PDF file, or several? What data in particular are you looking for in the pdf - a table of numeric data, some text, or ..?
Why go through Matlab at all? Use Excel directly. A quick google search will tell you how to import pdf's to Excel.
I have thousands of pdf files and get data from the pdf files and manually it's very difficult.That is why i am using matlab at all.Thanks
Have you considered using pdftotext? Or any other converter, to HTML for example? Supposing that you are able to convert the file to text, what would you be looking in it for? Is there just one page of data that you need or one line from each page or..?
You might want to provide an example of a PDF that you wish to extract data from, and indicate which data in the file you want.
@azizullah khan: You wrote "but pdfParsedemo makes a problem with me...". Please explain the problems. Your question is much to vague to be answered efficiently.
The problem with pdfParsedemo:...when i simulate the code the following error appear
??? Java exception occurred:
java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
Error in ==> Untitled at 20
pdfstr = reader.getText(pdfdoc) %#ok
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
Hoeff Hayes: I have attached pdf file which i want to read and extract account info and some other data.Please explain any possibility of it.Thanks
Azizullah - you did not include an attachment.
As for the error, the AFMParser is part of the FontBox library. Did you add the FontBox jar file path to your Java class path? I looked at the pdfParsedemo.m script, and while it doesn't have a command to do so, you probably should. So if you updated
javaaddpath('M:\My Documents\MATLAB\PDF Exercise\PDFBox-0.7.3\lib\PDFBox-0.7.3.jar')
to the path on your workstation that corresponds to PDFBox-0.7.3.jar (or whatever the jar file is), then you should add an equivalent statement for the FontBox
javaaddpath('whateverYourPathIsTo\FontBox-someVersionIds.jar')
(I don't know what the name of the jar is, so FontBox-someVersionIds.jar is just an example.)
Yes.I did it as required.If there is any way to convert pdf into excel in matlab kindly share with me.For example: if we can load a pdf to another software with the help of matlab and then convert pdf into excel and got the output? IS it possible in matlab to operate another software?Thanks
Unfortunately, this is not something that I have considered and so am not aware of any other means of reading the pdf into MATLAB. You could always try the pdftotext program.
Naftali
Naftali on 15 Jun 2016
Edited: Naftali on 15 Jun 2016
I am no expert but could not find a way to read a pdf file to Matlab. People talk here a bout text, but pdf is usually a series of pics. I go to professional adobe reader and export the pages of the pdf document either by file/save as or by Advanced/Export. This produces a png or jpeg file for each page of the document. From there it is easy in Matlab - loop over the pages with the imread function.
pdf is effectively a programming language; you need to execute the commands in order to determine what the output is.
Following up with Naftali's comment, there is also a way to convert a PDF to an image file in MATLAB. See: https://www.mathworks.com/matlabcentral/answers/709623-how-can-i-convert-a-scanned-pdf-to-an-image-using-matlab

Sign in to comment.

 Accepted Answer

Christopher Creutzig
Christopher Creutzig on 16 Oct 2017
Edited: Walter Roberson on 4 Nov 2017
Just for the record, Text Analytics Toolbox (new in R2017b) includes a function extractFileText that will extract text data from PDF (or MS Word) files.

More Answers (1)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!