Convert PDF to Image with AWS Python Lambda Function and Docker Deployment

Cover Image for Convert PDF to Image with AWS Python Lambda Function and Docker Deployment

Learn how to use Python and Docker to create a Lambda function for converting PDF files to images and deploy it in a container.




PDF documents are a staple in many businesses and organizations today. However, sometimes it is necessary to convert these PDFs into images for various reasons, such as displaying them on a website or for further processing. In this article, we will explore how to convert PDFs to images using an AWS Lambda function and Docker deployment.

Pre-requirement

To get started with the process, you must have the following prerequisites:

  • AWS account
  • Docker installed on your machine
  • Basic understanding of Python programming language

Introduction to AWS Lambda

AWS Lambda is a serverless computing service offered by Amazon Web Services (AWS). It allows you to run code without having to manage any infrastructure, making it an ideal platform for running small, independent applications. This makes it an excellent choice for running the code to convert PDFs to images.

Python and its Libraries for PDF Conversion

Python is a popular programming language and is widely used in many industries, including finance, retail, and healthcare. It has a vast library of tools and packages that make it ideal for many tasks, including PDF conversion. Some of the most popular libraries for converting PDFs to images include PyMuPDF, PyPDF2, and pdf2image.

Setting up the Environment for Deployment

In order to deploy the code for PDF to image conversion on AWS Lambda, you will need to have a few things set up. Firstly, you will need to create an AWS account if you don't already have one. Next, you will need to install the AWS CLI and configure it with your account information. Once you have done this, you can proceed with creating the Lambda function and deploying the code.

Step 1: Create Input and Output S3 Bucket

  1. Log in to the AWS Management Console and navigate to the S3 service.
  2. Click on the "Create bucket" button.
  3. Choose a name for your bucket and select a region.
  4. Click on the "Create" button.
Create two buckets, one for PDF input and the other for image output

Step 2: Create a Lambda Function

  1. Go to the AWS Lambda page.
  2. Click on the "Create function" button.
  3. Choose "Container image" as the way to build the function.
  4. Select a name for your function, and choose the runtime environment as Python.
  5. Choose the appropriate execution role for your function, and click on the "Create function" button.
Must add Permissions Role to allow Lambda to access S3

Step 3: Write Python code to Convert PDF to Image and Build Docker Image for the Lambda Function


# app.py
import boto3
import fitz


def handle_s3_event(event, context):
    # Get the S3 client
    s3_client = boto3.client('s3')

    # Get the PDF object from S3
    obj = s3_client.get_object(Bucket=event['Records'][0]['s3']['bucket']['name'],
                               Key=event['Records'][0]['s3']['object']['key'])
    pdf_content = obj['Body'].read()

    # Open the PDF document using PyMuPDF
    pdf_doc = fitz.open(stream=pdf_content)

    # Convert each page of the PDF to a JPG image
    for i in range(len(pdf_doc)):
        page = pdf_doc[i]
        pixmap = page.get_pixmap(dpi=300)
        img = pixmap.tobytes()

        # Upload the JPG image to S3
        s3_client.put_object(Bucket='output-bucket',
                             Key=f"{event['Records'][0]['s3']['object']['key'].split('.')[0]}-page-{i + 1}.jpg",
                             Body=img)


def handler(event, context):
    try:
        handle_s3_event(event, context)
        print("Successfully converted PDF to JPG and uploaded to S3")
    except Exception as e:
        print("An error occurred: {}".format(e))
        raise e

To create a Docker Container for the Lambda function, you will need to package your code and dependencies into a Docker image. This can be done using a Dockerfile, which is a script that automates the process of creating the image. The Dockerfile should include the necessary dependencies and environment settings for your code to run correctly.


# Dockerfile

FROM public.ecr.aws/lambda/python:3.9

# Install the function's dependencies using file requirements.txt

COPY ./requirements.txt ./
RUN  pip3 install -r requirements.txt

# Copy function code
COPY app.py ./

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.handler" ]

Step 4: Pushing the Docker Image to AWS ECR then Deploy to Lambda

Once you have created the Docker image, you will need to upload it to a Docker repository. To do this, you will need to create an ECR repository on the AWS website. To create an ECR repository, follow these steps:

  1. Navigate to the ECR service page.
  2. Click on the "Create repository" button.
  3. Choose a name for your repository and click on the "Create repository" button.
  4. Add the appropriate permissions to avoid any error messages of "EOF".
  5. Click "View push commands" and follow the pushing process.
  6. After pushing the Docker image to the ECR repository, we can now go to the Lambda page and deploy the container we just pushed.

Step 5: Trigger the Lambda and Testing

Now that you have successfully deployed the Docker container in AWS Lambda, it's time to trigger it. The simplest way to trigger a Lambda function is to set up an event-driven trigger, such as a new file being uploaded to Amazon S3.

  1. Go to the AWS Lambda console and select the function you created.
  2. Select the "Add trigger" button in the Designer section.
  3. In the "Add trigger" page, select "S3" from the "Trigger" drop-down menu.
  4. In the "Configure triggers" section, select the S3 bucket you created in step 1 from the "Bucket" drop-down menu.
  5. Select the "PUT" from the "Event type" drop-down menu.
  6. Finally, click on the "Add" button.


Every time a new PDF is added to the input S3 bucket, the Lambda function will start the conversion process automatically. The resulting image will be saved in the output S3 bucket, which can be monitored in the AWS Lambda console or the output S3 bucket directly.


To check if the conversion process is working, we'll upload a PDF file to the S3 bucket and let the Lambda function convert it. After the process is complete, we'll inspect the output to make sure it meets our expectations.


Alternatively, you can test the function using a testing event JSON. Simply input this JSON into your testing environment to quickly verify that your system is working correctly.


//Templete of testing Event JSON
{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "1970-01-01T00:00:00.000Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "EXAMPLE"
      },
      "requestParameters": {
        "sourceIPAddress": "127.0.0.1"
      },
      "responseElements": {
        "x-amz-request-id": "EXAMPLE123456789",
        "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH"
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "testConfigRule",
        "bucket": {
          "name": "example-bucket",
          "ownerIdentity": {
            "principalId": "EXAMPLE"
          },
          "arn": "arn:aws:s3:::example-bucket"
        },
        "object": {
          "key": "example.pdf",
          "size": 1024,
          "eTag": "0123456789abcdef0123456789abcdef",
          "sequencer": "0A1B2C3D4E5F678901"
        }
      }
    }
  ]
}

Conclusion

Converting PDFs to images can be a valuable tool in various industries and businesses. By utilizing AWS Lambda and Docker deployment, it is possible to efficiently convert PDFs to images. This can provide an easy way to display PDFs on websites or for further processing, without the need for managing any infrastructure. While there are many other online tools available for converting PDFs to images, using AWS Lambda and Docker can offer a robust and scalable solution.



More Stories