pdf2img2pdf, A Python Code to Compress PDF files
I ran into a new problem when making my PDFs unsearchable in my last post — The size of the newly generated PDFs were much larger than the original ones. Although this wasn’t a surprise at all, the size problem did become more serious after I converted hundreds of them. Unfortunately, the Method 1 “Microsoft Print to PDF” tool doesn’t seem to provide any image quality control to adjust the output file size. My Method 2 “pdf2img2pdf” can do this, and it’s automatic, perfect for batch convertion.
Again, the concept is pretty simple. The pdf2img2pdf tool reads in a PDF, retrieving all the pages and converting them into image objects, and then save all the images back to PDFs. You have two places to adjust the image quality: 1) when reading in the PDF; 2) when saving to PDF. Both can be found in the highlight code below. After some experiments, I found the 2nd place has the deciding power in terms of the compression ratio, while keeping the 1st place high will yield to a good image quality. Check out the analysis results at the end of this post.
from pdf2image import convert_from_path
from PIL import Image
import os, glob
def pdf_to_images( pdf_path, image_format='jpeg', dpi=200 ):
"""
Converts each page of a PDF to an image object.
Return a list of the image object
Args:
pdf_path (str): Path to the input PDF file.
image_format (str, optional): Format for the output images (e.g., 'jpeg', 'png'). Defaults to 'jpeg'.
dpi (int, optional): DPI for the output images. Defaults to 200.
"""
pages = convert_from_path( pdf_path, fmt=image_format, dpi=dpi, jpegopt={
"quality": 80,
"progressive": True,
"optimize": True}
)
return pages
def images_to_pdf( page_images, output_path, quality=60 ):
"""
Converts multiple images to a single PDF file.
Args:
page_images: A list of image objects.
output_path: The path to save the PDF file.
"""
images = [page.convert('RGB') for page in page_images]
images[0].save(output_path, save_all=True, append_images=images[1:], quality=quality)
if __name__ == '__main__':
pdf_directory = input("folder path to the input PDF files:")
path_prefix = "img_"
output_path = path_prefix + pdf_directory
pdf_files = [os.path.join(pdf_directory, f) for f in os.listdir(pdf_directory) if f.endswith(('.pdf'))]
if(len(pdf_files)>0):
if(not os.path.exists(output_path)):
os.makedirs(output_path)
print("created ", output_path)
for pdf_file in pdf_files:
print("processing:", pdf_file)
page_images = pdf_to_images(pdf_file)
images_to_pdf(page_images, path_prefix+pdf_file, 50)
I tested my code with different image quality control during the saving (on line 30) while keeping the reading (on line 15) at 80. The file sizes change dramatically in accordance with the levels, and the image quality becomes quite poor at low levels. I guess I just need to find the best value to balance between the size and quailty for various applications.






