How to read PDF from S3 on Lambda trigger

Written by: Chirag (Srce Cde)


AWS Lambda is the popular event-driven service that lets you run code without worrying about the provisioning or managing of servers. Here, in this tutorial, we are going to use AWS Lambda to read the PDF file from S3 on the trigger. For reading the PDF file we are going to use a third-party library/package which is PyMuPDF. Well, of course, there are lots of packages that you can use to deal will the PDF depending on the requirement. One of the reasons for picking PyMuPDF is that it is actively maintained. Now, let’s start with the steps that we are going to follow throughout the tutorial.

read pdf s3 on lambda trigger
Steps we will follow throughout this tutorial

We will start with the configuration & creation of the lambda layer for the PyMuPDF package, which we will use to read the PDFs within the lambda function. To create the layer, we will use an EC2 instance. So, let’s get started with Lambda layers. Please follow the below steps.

Create an IAM role for EC2 instance

The reason we will require the IAM role for the EC2 instance is to provide access to the S3 bucket so that once we configure the package we can upload it to S3.

  • Navigate to IAM Management Console
  • Click on Roles from the left panel
  • Click on Create role -> Select EC2 as service -> Click on Next:Permissions
  • Here we need to attach a policy, ideally, we need only put access hence you can create a custom policy for the same. Here I have attached AmazonS3FullAccess. Click on Next and follow the steps.
  • Enter the role name (In my case I have named it ec2_s3_pdf_role) and Create role

Spin EC2 instance, create & configure the package, upload it to S3

We will start with the creation of the EC2 instance, you can refer to the steps from my tutorial on create an EC2 instance. Here, we will select ami-0885b1f6bd170450c/Ubuntu Server 20.04 LTS as the AMI. And in Step 3: Configure Instance Details while spinning an instance select the IAM role that we have just created.

Once, the instance is up & running go ahead and SSH into an instance. After login into an instance, execute the below commands.

  • sudo apt-get update (Fetching updates)
  • python3 -V (Check python version, in my case it’s Python3.8 and it is fine since I will be using Runtime Python3.8 within lambda function)
  • sudo apt install python3-pip (Installing python package manager to install PyMuPDF)
  • sudo apt install zip (Installing zip, since we need to zip the package before uploading it to S3)
  • sudo apt install awscli (Installing AWS CLI to upload file to S3)
  • mkdir -p build/python/lib/python3.8/site-packages/ (Creating the directory structure)
  • pip3 install PyMuPDF -t build/python/lib/python3.8/site-packages/ (Installing PyMuPDF library/package into the site-packages directory)
  • cd build
  • zip -r pypdf.zip . (Creating the zip file)
  • aws s3 cp package.zip s3://rekognit (Uploading the package/pypdf.zip to S3. Please replace rekognit with your bucket name)

Create lambda layers

To create lambda layers, navigate to Lambda Management console -> Layers.

  • Click on create layer
  • Fill appropriate name (In my case it’s pypdf_demo)
  • Select Upload a file from Amazon S3 and paste the object URL (pypdf.zip). It will look like this https://bucket-name.s3.amazonaws.com/pypdf.zip
  • Within Compatible runtimes select Python 3.8 -> Create

We have successfully created the Lambda Layer, now we will move on to step 2 which is to create an S3 bucket.

Create S3 Bucket

  • Navigate to S3 Management Console -> Create bucket
  • Give appropriate name (In my case it’s pypdf-demo)

Create an IAM role for the lambda function

Please refer to the same steps as mentioned in Create IAM role for EC2 instance except for the service, role name, and the policy.

  • Please select Lambda as a service instead of EC2.
  • Within the policy, please attach AWSLambdaExecute. This policy Provides Put, Get access to S3, and full access to CloudWatch Logs.
  • Finally, fill in the role name (In my case it’s lambda_pdf_role)

Once the IAM role is created, we will go ahead and create the lambda function.

Create & configure Lambda function

  • Navigate to Lambda Management Console-> Functions (From left panel) -> Create function (Top-right corner)
  • Configure the lambda function. Give it a name, select runtime as Python 3.8 and within permissions please select Use an existing role and select the role that we have created in the above step -> Create function

Add S3 Trigger

  • From the Designer, pane click on Add Trigger -> Select S3
  • Select the Bucket name that we have created as a part of Create S3 Bucket
  • Configure suffix by adding .pdf since we only want to trigger this lambda function when the file with .pdf extension is uploaded -> Add

Add Layers

  • From the Designer pane click on Layers -> Add a layer (Under Layers)
  • Select Custom layers under Choose a layer
  • From the Custom layers dropdown select pypdf_demo -> Select the Version -> Add

Write/Modify the lambda function code

Navigate to this GitHub reference and copy/paste it under Function code within your Lambda function. I have included the comments.


If you want to learn more, please refer to this video.


Once the code is updated, Deploy the lambda function and we are good to test the functionality.

Test

Navigate to the S3 bucket and upload the PDF file. Once the file is uploaded, it will trigger the lambda function. No, go ahead and check the CloudWatch logs of the lambda function and you should be able to see the text content of the PDF you have uploaded.

Please refer to this video for end-to-end practical implementation. I will post the CloudFormation template for the same soon. Please stay tuned.