MATLAB Answers

How to extract data from pdf file in matlab?

744 views (last 30 days)
azizullah khan
azizullah khan on 19 Sep 2014
Commented: Yue Zhao on 30 Jun 2019
I am in search of such algorithm that will extract data from pdf file.For example in the pdf file a sentence is present i.e: Account# 29 . I want to extract 29 from pdf file.If it is possible by fopen() function ,please share it with me.I have tried pdftotext but doesn't succeed. Now if it is possible to extract data from pdf with the help of fopen(), it will be better.I also tried fopen() but leads to failure.Please share you experience with me..Thanks.

  6 Comments

Show 3 older comments
azizullah khan
azizullah khan on 19 Sep 2014
Have you tried modify pdf file? Have you tried the link which i have mentioned in the previous comment?I am stuck in extracting data from pdf.i was using external converter but now i want it to be done inside matlab.Please solve my problem..thanks
José-Luis
José-Luis on 19 Sep 2014
Yes, I have seen it and it doesn't work. In principle, it might work for trivial purposes like changing the font type, but I have no idea what kind of data you are trying to extract.
Writing a robust algorithm is a tall order.
azizullah khan
azizullah khan on 20 Sep 2014
Sir,Just give a clue that how it is possible: let suppose Pdf file contain:
Account# 345
i want to capture 345 from it..for example i can use regexp() to extract numbers only...Please help me...I have spend lots of my time on it..but doesn't succeed..Almost i have wasted a month for it....Thanks

Sign in to comment.

Accepted Answer

Jan
Jan on 21 Sep 2014
Assume you have a PDF file, which is displayed containing the string "Account# 345". Now different details impede the extraction of this string:
  • The contents can be compressed and/or encrypted, such that the string cannot be found in clear text inside the file.
  • Even without encryption or compression, the text need not be stored continously, but in a valid PDF each character can be stored with its paper position, such that the order does not matter.
In consequence searching a string in a PDF is not reliable. Therefore some OCR software is applied frequently to add an additional layer containing the contents as searchable strings. But as long as you do not specify any details of your PDF we cannot guess if they contain such strings.
Please notice, that your problem is not well defined and suggesting solutions is still based on guessing, although you've posted several corresponding questions in this forum. Finally the main problem is, that somebody decided to store data in PDF files, which is not sufficient for the later extraction of strings. Creating a large and complicatd workaround afterwards is an inefficient way. It would be more stable and faster to obtain the data in a more suitable format as a text file.

  5 Comments

Show 2 older comments
azizullah khan
azizullah khan on 25 Sep 2014
@Jose-luis: Sorry but in my mind positive response will that when problem solved.. and thank for comment. Sir, If you can give me some time from your precious time and type code that how i will use ocr in matlab..and convert data into cells. My main aim is:
1) Read original pdf in matlab and convert data into cells.
2) Read Scanned pdf in matlab and convert data into cells. I you can help me and do the above two step further there is no problem to me to extract data from cells.
Up till now i am using external pdf to excel converter but now i want that to do that conversion inside matlab. I have also attached original pdf..please help me.. Please help me..I shall be very thankful to you for your this help from the core of my heart for the rest of my life....Thanks
Noam Greenboim
Noam Greenboim on 25 May 2015
A possible workaround is to convert the PDF to an Excel file, and then import that XLS file to Matlab. This is a relatively good solution for PDF's that contain tables of data.
If you take a look first at the Excel file, you might find ideas how to access the data you're interested in.
Yue Zhao
Yue Zhao on 30 Jun 2019
We will get a matrix if we use imread for a picture. How do we get the matrix of the PDF?

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!