How to make sure S3 doesn't put parallel event to lambda? - amazon-web-services

My architecture allows files to be put in s3 for which Lambda function runs concurrently. However, the files being put in S3 are somehow overwriting because of some other process in a gap of milliseconds. Those multiple put events for the same file are causing the lambda to trigger multiple times for the same event.
Is there a threshold I can set on s3 events (something that doesn't trigger the lambda multiple times for the same file event.)
Or what kind of s3 event only occurs when a file is created and not updated?
There is already a code in place which checks if the trigger file is present. if not, it creates the trigger file. But that is also of no use since the other process is very fast to put files is s3.
Something like this below -
try:
s3_client.head_object(Bucket=trigger_bucket, Key=trigger_file)
except ClientError as _:
create_trigger_file(
s3_client, trigger_bucket, trigger_file
)

You could configure Amazon S3 to send events to an Amazon SQS FIFO (first-in-first-out) queue. The queue could then trigger the Lambda function.
The benefit of using a FIFO queue is that each message has a Message Group ID. A FIFO queue will only provide one message to the AWS Lambda function per Message Group ID. It will not send another message with the same Message Group ID until the earlier one has been fully processed. If you set the Message Group Id to be the Key of the S3 object, then it would effectively have a separate queue for each object created in S3.
This method would allow Lambda functions to run in parallel for different objects, but for each particular Key there would only be a maximum of one Lambda function executing.

It appears your problem is that multiple invocations of the AWS Lambda function are attempting to access the same files at the same time.
To avoid this, you could modify the settings on the Lambda function to Manage Lambda reserved concurrency - AWS Lambda by setting the reserved concurrency to 1. This will only allow a single invocation of the Lambda function to run at any time.

I guess the problem is that your architecture needs to write to the same file. This is not scalable. From the documentation:
Amazon S3 does not support object locking for concurrent writers. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you must build an object-locking mechanism into your application.
So, think about your architecture. Why do you have a process that wants to process multiple times to the same file at the same time? The lambda's that create these S3 files, do they need to write to the same file? If I understand your use case correctly, every lambda could create an unique file. For example, based on the name of the PDF you want to create or with some timestamp added to it. That ensures you don't have write collisions. You could create lifecycle rules on the S3 bucket to delete the files after a day or so, such that you don't increase your storage costs too much. Or have a lambda delete the file when it is finished with it.

Related

Is it possible to achieve parallel processing in AWS Lambda

I am having a python code in AWS Lambda which is triggered based on sqs event generated.
The criteria for generating sqs is if a new file comes into a particular S3 location, then sqs will be created and which in turn calls lambda.
Right now, lambda is processing the files one after the other in a serial mode. But I would like to know if we can process multiple files at the same time.
Example: If 5 files comes to s3 location, all the 5 files should be processed parallely at the same time.
I think you might miss observed the behavior of your system. If you using the Native SQS Standar Queue with Lambda integration, the Lambdas will consume the queus in batches, you can see a detailed explanation here:
https://aws.amazon.com/blogs/compute/understanding-how-aws-lambda-scales-when-subscribed-to-amazon-sqs-queues/
No need to add the SQS.
Enable triggering from the S3 PutObject action to the Lambda. With this, you can ensure invocations per object and also parallelism.
Also, check your Reserve concurrency value
Doc:https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Check the concurrency dashboard in the monitoring sub section, it is already running in parallel mode.
File may look
Hope this helps!!!

AWS lambda - best practice when reading from long list/s3

I have a scheduled error handling lambda, I would like to use Serverless technology here as opposed to a spring boot service or something.
The lambda will read from an s3 bucket and process accordingly. The problem is at times the s3 bucket may have high volume of data to be processed. long running operations aren't suited to lambdas.
One solution I can think of is have the lambda read and process one item from the bucket and on success trigger another instance of the same lambda unless the bucket is empty/fully-processed. The thing i don't like is that this is synchronous and quite slow. I also need to be conscious of running too many lambdas at the same time as we are hitting a REST endpoint as part of the error flow and don't want to overload it with too many requests.
I am thinking it would be nice to have maybe 3 instances of the lambdas running at the same time until the bucket is empty but not really sure, I am wondering if anyone has any nice patterns that could be used here or suggestions on best practices?
Thanks
Create a S3 bucket for processing your files.
Enable a trigger S3 -> Lambda, on every new file in the bucket lambda will be invoked to process the file, every file is processed separately. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Once the file is processed you could either delete or move file to other place.
About concurrency please have a look at provisioned concurrency https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Update:
As you still plan to use a scheduler lambda and S3
Lambda reads/lists only the filenames and puts messages into SQS to process the file.
A new Lambda to consume SQS messages and process the file.
Note: I would recommend using SQS initially if the files/messages are not so big, it has built it recovery mechanics, DLQ , delays, visibility etc which you could benefit more than the simple S3 storage, second way is just create message with file reference and still use SQS.
I'd separate the lambda that is called by the scheduler from the lambda that is doing the actual processing. When the scheduler calls the first lambda, it can look at the contents of the bucket, then spawn the worker lambdas to process the objects. This way you have control over how many objects you want per worker.
Given your requirements, I would recommend:
Configure an Amazon S3 Event so that a message is pushed to an Amazon SQS queue when the objects are created in the S3 bucket
Schedule an AWS Lambda function at regular intervals that will:
Check that the external service is working
Invoke a Lambda function to process one message from the queue, and keep looping
The hard part would be throttling the second Lambda function so that it doesn't try to send all request at once (which might impact that external service).
You could probably do this by using a Step Function to trigger Lambda and then, if it was successful, trigger another Lambda function. This could even be done in parallel, such as allowing up to three parallel Lambda executions. The benefit of using Step Functions is that there is no cost for "waiting" for each Lambda to finish executing.
So, the Step Function flow would be something like:
Invoke a "check external service" Lambda function
If it fails, then quit the flow
Invoke the "processing" Lambda function
Get one message
Process the message
If successful, remove the message from the queue
Return success/fail
If it was successful, keep looping until the queue is empty

Is there a way to add delay to trigger a lambda from S3 upload?

I have a Lambda function which is triggered after put/post event of S3 bucket. This works fine if there is only one file uploaded to S3 bucket.
However, at times there could be multiple files uploaded which can take upto 7 minutes to complete the upload process. This triggers my lambda function multiple times which adds overhead of handling this from the code.
Is there any way to either trigger the lambda only once for the complete upload or add delay in the function and avoid multiple execution of Lambda function?
There is no specific interval when the files will be uploaded to S3 hence could not use scheduler.
Delay feature was added for Lambda that has Kinesis or DynamoDB Event Sources recently. But it's not supported for S3 events.
You can send events from S3 to SQS. Then your Lambda will consume SQS events. It consumes them in batch by default.
It seems Multi Part Upload is being used here from the client.
Maybe a duplicate of this? - AWS Lambda and Multipart Upload to/from S3
An alternative might be to have your Lambda function check for existence of all required files before moving on to the action you need to take. The Lambda function would still fire each time, but would exit quickly if not all files have been received yet.

SQS Trigger Lambda, with FileName in S3 for text extraction

I have a use case, I have a list of pdf files stored in S3 Bucket, I have listed them and push them to SQS for Text Extraction, Created one Lambda for processing those files by providing bucket information and TextExtraxtion Information of AWS.
The issue is, Lambda is getting Timeout, as SQS trigger multiple lambda instance of all files and all of them waiting for Text Extract Service.
Lambda to trigger one by one, for all SQS message(FileName) so that Timeout does not occur, As we do have a limit for accessing AWS TextExtract
Processing 100+ files is a time consuming task, I would suggest taking no more than 10 files per Lambda execution.
Use SQS with Lambda as an event source.
https://dzone.com/articles/amazon-sqs-as-an-event-source-to-aws-lambda-a-deep

How to determine how many times my lambda executed for a certain of time

I have one Lambda it is executed on s3 put item trigger.
Now in s3 any objects uploaded lambda is triggering..
Let say Some one uploaded 5 files in s3 so each time it will execute the lambda for 5 files...
Is there any way that lambda can trigger only one time for all those 5 files...
Can I trace after complete of 5 time triggers/lambda execution...How many minutes lambda is not executing as no files uploaded..
Any help will really helpful for me
If you have the bucket notification configured for object create (s3:ObjectCreated) and if you haven't specified a filter or filter satisfies the uploaded objects your lambda will get triggered each time for per uploaded object.
To see the number of invocations, you can look at the Invocations metric for your lambda function in Cloudwatch metrics
You may want to add a queue that will handle the requests to process new files on S3.
A relevant one could be Kinesis data stream / SQS. If batching is important to you, Kinesis will be probably better.
The requests can be sent by a lambda triggered by S3 as you described, but it will only send the request to the queue, and another lambda will then process it. A simpler way will be to send the request in the same code that puts the object in S3 (if possible).
This way you can have statistics of how many requests were sent, processed, waiting, etc.