MATLAB Answers

How can I convert a scanned PDF to an image using MATLAB?

1,284 views (last 30 days)
How can I import a scanned PDF into MATLAB and convert it to image files?
I tried to use extractFileText() from Text Analytics Toolbox, but it only works for native PDFs and not scanned PDFs:
>> extractFileText('example.pdf')
ans =
<missing>

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 5 Jan 2021
MATLAB ships with the Apache PDFBox Java library which allows importing and rendering PDF files. Use the following MATLAB function PDFtoImg() to import a scanned PDF, and save each page as a separate PNG file:
function images = PDFtoImg(pdfFile)
import org.apache.pdfbox.*
import java.io.*
filename = fullfile(pwd,pdfFile);
jFile = File(filename);
document = pdmodel.PDDocument.load(jFile);
pdfRenderer = rendering.PDFRenderer(document);
count = document.getNumberOfPages();
images = [];
for ii = 1:count
    bim = pdfRenderer.renderImageWithDPI(ii-1, 300, rendering.ImageType.RGB);
    images = [images (filename + "-" +"Page" + ii + ".png")];
    tools.imageio.ImageIOUtil.writeImage(bim, filename + "-" +"Page" + ii + ".png", 300);
end
document.close()
Notes:
1. It is important to split the input PDF data into images for each PDF page. For example, if “example.pdf” contains 13 pages, then we should convert 13 pages to 13 images.
2. For subsequent OCR tasks, is important to render the PDF pages with 300 dpi or higher resolution:\n
>> bim = pdfRenderer.renderImageWithDPI(ii-1, 300, rendering.ImageType.RGB);
  8 Comments
Patrick Fitzgerald
Patrick Fitzgerald on 18 Apr 2021 at 18:54
Note that in this related question this line (innermost IF block) throws the same error.
xObject = resources.getXObject(key);
Additionally, the import javax.imageio.ImageIO.* code used therein does let me call scanForPlugins() but does not fix anything.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!