How can I extract images from a PDF using MATLAB?

55 views (last 30 days)
I would like to extract embedded images from a native PDF file using MATLAB. How can I do this?

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 11 Jan 2021
MATLAB ships with the Apache PDFBox Java library which allows importing and processing PDF files. Use the following MATLAB function extractImagePDF() to extract images from a native PDF and save them as JPG files:
function extractImagePDF(pdfFile)
import java.io.*
import javax.imageio.ImageIO.*
import org.apache.pdfbox.*
filename = fullfile(pwd,pdfFile);
jFile = File(filename);
document = pdmodel.PDDocument.load(jFile);
catalog = document.getDocumentCatalog();
pages = catalog.getPages();
iter = pages.iterator();
% look for image objects on each page of the PDF
while (iter.hasNext())
page = iter.next();
resources = page.getResources();
pageImages = resources.getXObjectNames;
if ~isempty(pageImages)
imageIter = pageImages.iterator();
i = 1;
% extract each image object from page and write to destination folder
while (imageIter.hasNext())
key = imageIter.next();
if (resources.isImageXObject(key))
xObject = resources.getXObject(key);
img = xObject.getImage();
outputfile = File("Img_"+i +".jpg");
write(img, "jpg", outputfile);
end
i = i+1;
end
end
end
document.close();
Note that the above code will not work for scanned PDF files.
  2 Comments

Sign in to comment.

More Answers (0)

Tags

No tags entered yet.

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!