AWS lambda pdf processing - amazon-web-services

AWS lambda pdf processing - amazon-web-services

I have created an AWS lambda (url function) that creates a pdf or xls synchronously using python weasyspint for pdf and xlswriter for xls. Pdf takes around 2 mins to generate where as csv takes 10-20 seconds for same amount of data. How can I make pdf generate faster?

Related

should batch_writer in dynamodb be super fatser?

I have a weired results on uploading 17,956,000,000 records into dynamodb. Thos records are on 134,000 files (and each file have a 134,000 record). If you ask about those records, its a full city distance matrix (the city has 134,000 nodes). To do so, i have uploaded all files into s3 pucket. Also, I have tried a lambda python function to do a sqs queue later. When measure a single file uploading time (from s3 to dynamodb using lambda python function), i couldn't upload a full file withing 15 min (max exec time on lambda). I did some tests comparing the batch write with a 3 minutes execution as:
with table.batch_writer() as bw:
for k, v in obj.items():
bw.put_item(Item={"_from":src_node,"_to":dst_node,"travel_time":v})
which got to upload a total of 2,475 records.
for k, v in obj.items():
dynamodb.put_item(TableName='xxx', Item={"_from":{"N":src_node},"_to":{'N':dst_node},"travel_time":{"N":str(v)}})
which got to upload a total of 2,404 records.
I wan't to validate if those results are logical or something should be wrong.

Is there any limit on number of pdf pages to be OCRed using AWS Textract?

I am OCRing image based pdfs using AWS Textract
my each PDF I have has 60+ pages
but when I try to OCR the pdf file it only does that for the first 4 pages of each file.
is there any limit on number of pages in the pdf file for AWS extract
I found this https://docs.aws.amazon.com/textract/latest/dg/limits.html
but it does not mention any limit on the number of pages!!
Any one know if there is any limit of the pdf pages?
and if so, how can I do the OCR for the whole file 60+ pages?

The hard limits for textract are 1000 pages or 500mb for PDFs.
I think that your problem is related to the batch response of textract. You have to look if the key "NextToken" in the json output is populated and if so, you have to make another request with that token.

How to trigger AWS Lambda function when multiple files in S3 are ready

I am trying to build a service with AWS Lambda/S3 that takes as input a users email and outputs a responding email with a PDF attachment. The final PDF I send to the user is generated by merging together two types of PDFs I generate earlier in the process based on the input email. A full diagram of the architecture is found in the diagram below.
Diagram of Architecture
The issue I am encountering is with regards to the Merge PDFs Lambda function that takes in the type 1 and type 2 PDFs and produces a type 3 PDF. I need it to trigger once a complete set of type 1 and 2 PDFs is ready and waiting in S3. For example, a user sends an email and the Parse Email function kicks off the production of one type 2 PDF and fifty type 1 PDFs - as soon as these 51 PDFs are generated I want the Merge PDFs function to run. How do I get an AWS Lambda function to trigger once a set of multiple files in S3 are ready?

There is no trigger that I am aware of that waits for several things to be put into S3 in one or more buckets before raising an event.
I originally thought about using a s3 trigger when a file with the suffix '50.pdf' was created, but that leaves a lot of issues around what finishes first and what happens if something50.pdf fails to generate. But if you do want to go down that route, there is some good documentation from AWS here.
An alternative would be to have the lambdas that generate the type 1 and 2 pdfs to invoke the Merge PDF Lambda once they have finished their processing.
You would need to have some sort of external state held somewhere (like a db) which noted some sort of id (which could be included the naming of the type 1 and 2 pdfs) and if type 1 pdf generation was complete and if type 2 pdf generation was complete.
So the Parse Email Lambda would need to seed a db with a reference before doing its work. Then the URL to PDF Lambda would record on the db that it had finished and check the db if the HTML to PDF Lambda had finished. If so, invoke Merge PDF Lambda (probably via SNS) or if not finish. HTML to PDF Lambda would do the same thing, except it would check to see if the URL to PDF Lambda had finished before starting the merge or finishing.
On a slightly separate note, I'd probably trigger the Clean Buckets Lambda at the end of the Merge PDF Lambda. That way you could have a Check For Unprocessed Work Lambda that triggered every hour and made some form of notification if it found anything in the buckets older than x.

Upload/Download LARGE files to/from Lambda function using API Gateway without making any use of S3 Bucket

I'm implementing a serverless API, using:
API Gateway
Lambda
S3 Bucket "If needed"
My flow is to :
Call POST or PUT method with a binary file "zip", upload it to Lambda.
In Lambda: unzip the file.
In Lambda: Run a determined script over the extracted files.
In Lambda: Generate a new zip.
Return it to my desktop.
This flow is already implemented and it's working well with small files, 10MB for uploading and 6MB for downloading.
But I'm getting issues when dealing with large files as it'll be the case on many occasions. To solve such issue I'm thinking about the following flow:
Target file gets uploaded S3 Bucket.
A new event is generated and Lambda gets triggered.
Lambda Internal Tasks:
3.1 Lambda Download the file from S3 bucket.
3.2 Lambda Generate the corresponding WPK Package.
3.3 Lambda Upload the generated WPK package into S3.
3.4 Lambda returns a signed URL related to the uploaded file as a response.
But my problem with such design is that it requires more than a request to get completed. I want to do all this process in only 1 request, passing the target zip file in it and get the new zip as the response.
Any Ideas, please?
My Components and Flow Diagram would be:
Component and Flow Diagram

There are a couple of things you could do if you'd like to unzip large files while keeping a serverless approach :
Use Node.js for streaming of the zip file, unzipping the file in a pipe, putting the content in a write stream pipe back to S3.
Deploy your code to an AWS Glue Job.
Upload the file to S3,AWS Lambda gets triggered pass the file name as the key to the glue job and the rest will be done.
This way you have a serverless approach and a code that does not cause memory issues while unzipping large files

How can we efficiently push data from csv file to dynamodb without using aws pipeline?

Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?

If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js