pdf2img2pdf, A Python Code to Make a PDF Unsearchable, Converting Text-Based PDFs to Image-Based

It may sound a little wired as why anyone would want to make their PDFs unsearchable. The truth is — it’s not an uncommon request of doing so: protecting your copyright, avoiding unwanted edit, or preventing the search engine from indexing your content, just to name a few. This post will introduce the ways to make the PDFs unsearchable.

This can be done by “printing” the PDF to a virtual printer that saves it as an image-based PDF, which is the most common method across different PDF software like Adobe Acrobat. Method 1 will show you how to do it manually if you only have a handful of them to process. However, if you have a lot of those PDFs, you will need a batch conversion tool. The Python script in Method 2 will help you accomplish this.

Method 1: Manually converting to Image-Based PDF Using Micrsoft Print to PDF

Open your PDF using Adobe Acrobat or any other PDF reader. Choose “Menuāž”Print” or hit “CTRL+P” to bring up the “Print” window. Click the “Advanced” button.

On the “Advanced Print Setup” window, make sure to click the “Print As Image” option. Click “OK” to exit this window and Click “Print” to print the text-based PDF to an image-based one.

Now let’s take a look at the result. The two PDFs, named “form_searchable.pdf” (left) and “form_unsearchable.pdf” (right) respectively, look very similar, however the newly print PDF (right) lost its editability and can’t be searched anymore.

You might also notice that their file size differs significantly. The image-based unsearchable PDF is almost 5 times larger than the original text-based PDF in this case. This is understandable because image does take more space to store. If you do concern about the file size, please refer to my next post about compressing an image-based PDF.

Method 2: Batch Converting to Image-Based PDFs using Python Script

The concept is the same, but it will be the python script automatically handling all the process for you instead of manual operation. The process includes opening a PDF, reading all the pages, rendering pages into images, save all the images to a new PDF. I call it “pdf2img2pdf”, and the complete code lists below. Once executed, the code will prompt with a question for the folder path to all the input PDFs. Newly generated PDFs will be saved to a new folder with “img_” prefix. The code requires the following dependencies:

from pdf2image import convert_from_path
from PIL import Image
import os, glob

def pdf_to_images( pdf_path, image_format='jpeg', dpi=200 ):
    """
    Converts each page of a PDF to an image object. 
    Return a list of the image object
    Args:
        pdf_path (str): Path to the input PDF file.
        image_format (str, optional): Format for the output images (e.g., 'jpeg', 'png'). Defaults to 'jpeg'.
        dpi (int, optional): DPI for the output images. Defaults to 200.
    """
    pages = convert_from_path( pdf_path, fmt=image_format, dpi=dpi, jpegopt={
                                                        "quality": 80, 
                                                        "progressive": True, 
                                                        "optimize": True}
                                                        )
    return pages

def images_to_pdf( page_images, output_path, quality=60 ):
    """
    Converts multiple images to a single PDF file.

    Args:
        page_images: A list of image objects.
        output_path: The path to save the PDF file.
    """
    images = [page.convert('RGB') for page in page_images]
    images[0].save(output_path, save_all=True, append_images=images[1:], quality=quality)

if __name__ == '__main__':

    pdf_directory = input("folder path to the input PDF files:")
    path_prefix = "img_"
    output_path = path_prefix + pdf_directory
    pdf_files = [os.path.join(pdf_directory, f) for f in os.listdir(pdf_directory) if f.endswith(('.pdf'))]
    if(len(pdf_files)>0):
        if(not os.path.exists(output_path)):
            os.makedirs(output_path)
            print("created ", output_path)

    for pdf_file in pdf_files:
        print("processing:", pdf_file)
        page_images = pdf_to_images(pdf_file)
        images_to_pdf(page_images, path_prefix+pdf_file, 50)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.