AWS Lambda hangs between invocations - amazon-web-services

I am using the following 3 services: Amazon S3, Amazon SQS and AWS Lambda.
The same configuration is created for processing for both CSV and EXCEL files (the lambda function that processes EXCEL files is just converting them to CSV files and re-uploading them to S3 in order for the other lambda function to process them)
AWS Lambda configurations:
Memory: 1024 MB
Timeout: 6 minutes
Reserved concurrency: 1 (for the current testing I don't need multiple parallel functions)
Retry attempts: 0
DLQ: none configured at the moment (will be added later)
For Amazon S3:
On a 's3:ObjectCreated:*' event, the S3 sends a message to a configured SQS queue.
The SQS queue has a Lambda trigger attached to it.
I have an external process that is uploading files to my S3 bucket.
This is the start of the entire workflow (S3 -->SQS --> Lambda)
This process has uploaded around 40 files in a very short period of time (some CSV files and some EXCEL files as well).
I was looking into the SQS queue and CloudWatch to see how the processing was going and I was able to see about 15 messages in flight for the SQS queue that is handling the CSV files and about 17 messages in flight for the SQS queue that is handling the EXCEL files and the logs in CloudWatch were being updated and everything was going good.
After about 15 seconds of processing everything stopped. Both lambda functions were just hanging. I was still seeing around 15 and 13 messages in flight for both SQS queues but absolutely nothing was being done on the AWS Lambda.
It looked like something went wrong.
After about 5 minutes of doing nothing both functions suddenly started to process the files. Both functions processed a couple of files for about 15 seconds and then silence once again.
After another 5 minutes of doing nothing both functions started again to process the files.
This happened a couple of times with 5 minutes breaks.
The Lambda functions are not doing any external calls or something that could make them hang. The waiting was between AWS Lambda invocations so it wasn't within my code.
For example:
2021-01-22T17:23:56.426+02:00 REPORT RequestId: d0a01831-ff93-5a71-83d6-40b50fd0affa Duration: 453.19 ms Billed Duration: 454 ms Memory Size: 1024 MB Max Memory Used: 319 MB
2021-01-22T17:29:41.860+02:00 START RequestId: 752f0eef-6738-5c24-ad52-566b96983c92 Version: $LATEST
What is making AWS Lambda to hang?
PS: If it would be helpful then I could attach the CloudWatch logs.

I think the culprit is "Reserved concurrency: 1". The SQS --> Lambda part involves an invisible middle man called event source mapping which polls the queue and invokes the Lambda for you.
When messages are available, [The event source mapping] Lambda reads up to 5 batches and sends them to your function.
Since you have configured the reserved concurrency to be 1, when the event source mapping tries to do 5 invocations at once, 4 of them get "Rate exceeded" error. The message will then be put back into the queue after the visibility timeout. After that, the queue will trigger the lambda again and this "try and error" process continues. Every time the lambda is only able to process 1 message, and the "hanging" behaviour you are seeing is actually the remaining 4 messages waiting for the visibility timeout.

Your use case may make for a workflow using AWS Step functions. You can break down each step into separate Lambda functions and track each step in AWS Step functions. This is a really nice tool. You can step through and see each step. For details, see:
Create AWS serverless workflows by using the AWS SDK for Java

Related

AWS Lambda read from SQS without concurrency

My requirement is like this.
Read from a SQS every 2 hours, take all the messages available and then process it.
Processing includes creating a file with details from SQS messages and sending it to an sftp server.
I implemented a AWS Lambda to achieve point 1. I have a Lambda which has an sqs trigger. I have set batch size as 50 and then batch window as 2 hours. My assumption was that Lambda will get triggered every 2 hours and 50 messages will be delivered to the lambda function in one go and I will create a file for every 50 records.
But I observed that my lambda function is triggered with varied number of messages(sometimes 50 sometimes 20, sometimes 5 etc) even though I have configured batch size as 50.
After reading some documentation I got to know(I am not sure) that there are 5 long polling connections which lambda spawns to read from SQS and this is causing this behaviour of lambda function being triggered with varied number of messages.
My question is
Is my assumption on 5 parallel connections being established correct? If yes, is there a way I can control it? I want this to happen in a single thread / connection
If 1 is not possible, what other alternative do I have here. I do not want to have one file created for every few records. I want one file to be generated every two hours with all the messages in sqs.
A "SQS Trigger" for Lambda is implemented with the so-called Event Source Mapping integration, which polls, batches and deletes messages from the queue on your behalf. It's designed for continuous polling, although you can disable it. You can set a maximum batch size of up to 10,000 records a function receives (BatchSize) and a maximum of 300s long polling time (MaximumBatchingWindowInSeconds). That doesn't meet your once-every-two-hours requirement.
Two alternatives:
Remove the Event Source Mapping. Instead, trigger the Lambda every two hours on a schedule with an EventBridge rule. Your Lambda is responsible for the SQS ReceiveMessage and DeleteMessageBatch operations. This approach ensures your Lambda will be invoked only once per cron event.
Keep the Event Source Mapping. Process messages as they arrive, accumulating the partial results in S3. Once every two hours, run a second, EventBridge-triggered Lambda, which bundles the partial results from S3 and sends them to the SFTP server. You don't control the number of Lambda invocations.
Note on scaling:
<Edit (mid-Jan 2023): AWS Lambda now supports SQS Maximum Concurrency>
AWS Lambda now supports setting Maximum Concurrency to the Amazon SQS event source, a more direct and less fiddly way to control concurrency than with reserved concurrency. The Maximum Concurrency setting limits the number of concurrent instances of the function that an Amazon SQS event source can invoke. The valid range is 2-1000 concurrent instances.
The create and update Event Source Mapping APIs now have a ScalingConfig option for SQS:
aws lambda update-event-source-mapping \
--uuid "a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" \
--scaling-config '{"MaximumConcurrency":2}' # valid range is 2-1000
</Edit>
With the SQS Event Source Mapping integration you can tweak the batch settings, but ultimately the Lambda service is in charge of Lambda scaling. As the AWS Blog Understanding how AWS Lambda scales with Amazon SQS standard queues says:
Lambda consumes messages in batches, starting at five concurrent batches with five functions at a time. If there are more messages in the queue, Lambda adds up to 60 functions per minute, up to 1,000 functions, to consume those messages.
You could theoretically restrict the number of concurrent Lambda executions with reserved concurrency, but you would risk dropped messages due to throttling errors.
You could try to set the ReservedConcurrency of the function to 1. That may help. See the docs for reference.
A simple solution would be to create a CloudWatch Event Trigger (similar to a Cronjob) that triggers your Lambda function every two hours. In the Lambda function, you call ReceiveMessage on the Queue until you get all messages, process them and afterward delete them from the Queue. The drawback is that there may be too many messages to process within 15 minutes so that's something you'd have to manage.

SQS invoke multiple lambda at same time

I am new in aws sqs as of now I understand sqs have a queue which is storing request messages (parameter) then our attached lambda will fetch numbers of messages based on the batch file which we set on lambda.
so if the sqs queue has 10000 messages and the lambda batch is set to 100 then in each pulling lambda which fetches 100 messages from the queue and executes all until all request are processed then again it will pull 100 messages and so on?
so as of now, I understand lambda will wait for the next pulling until the previous pulling process is finished.
hope I am correct if not please correct me.
now my requirement is lambda should not wait to finish the previous pulling instead it should pull the next 100 messages and execute parallelly for eg lambda should create a different instance(something like this) and each instance pulls 100 100 messages and execute parallelly.
In the situation you describe, the AWS Lambda service will automatically run multiple AWS Lambda functions based upon the concurrency settings of your function.
See: Lambda function scaling - AWS Lambda
The default is to permit up to 1000 concurrent executions of an AWS Lambda function.
Therefore, you do not need to change anything. It will automatically create multiple instances of the Lambda function in parallel and pass (up to) 100 messages to each execution.
For a really good series of articles to understand how AWS Lambda operates, see: Operating Lambda: Performance optimization – Part 1 | AWS Compute Blog

AWS lambda not processing S3 file fast enough

My requirement is to process files that gets created in S3 and stream the content of the file to SQS queue which will be consumed by other processes.
When a new files gets created in the S3 bucket, notification is published to SQS queue which triggers the lambda and the lambda written in Python process the file and publishes the content to SQS queue. File size at max is 100 MB so it might have 300K message but it is being processed very slow. I am not sure where the problem is, I have set the lambda memory limit to 10 GB and runtime to 15 mins. also I have set the concurrency limit to 100
S3---->SQS--->lambda-->SQS
I have the set the visibility timeout to 30 mins for the message; the processing is so slow that it moves the file creation message to dead letter queue.
It will take somewhere between 10 and 50 milliseconds to write a single message to SQS. If you have 300,000 messages that you're trying to write in a single Lambda invocation, then that's 3,000 seconds in the best case, which is larger than the Lambda timeout.
Once the Lambda times out, any SQS messages that it was processing will go back on the queue, and delivered against once their visibility timeout expires.
You can try multi-threading the code that writes messages to SQS. Since it's mostly network IO, you should be able to scale linearly up to a dozen or so threads. That may, however, just mean that your downstream message handlers get overloaded.
Also, reduce your batch size to 1. SQS will invoke multiple Lambdas if there are more messages in the queues.

Why is DynamoDBStream not triggering lambda function in parallel?

I have this setup
ApiGateway -> Lambda1 -> DynamoDB -> Lambda2 -> SNS -> SQS
Here is what am I trying to do:
Make an http request to ApiGateway.
ApiGateway is integrated to Lambda1, so Lambda1 gets executed.
Lambda1 inserts an object to DynamoDB.
DynamoDBStream triggers Lambda2. Batch size is 100.
Lambda2 publishes a message to SNS for every inserted record.
SQS is subscribed to SNS.
Basically, if I make an http request to Api Gateway I expect to see a message ending up in SQS. Actually, for a single request everything works as expected.
I made this test:
Make 10 http request to warmup lambda functions and wait for 30 seconds.
Create 100 threads. Each thread will make an http request until total request number is 10000.
2nd step of the test completes in 110 seconds. My DynamoDB is configured for 100 writes per second and this 110 seconds makes perfect sense. After 110 seconds I see these 10000 records in my DynamoDB table
The problem is that it takes too much time for messages to end up in SQS. I checked the logs of Lambda2 and I see that it still gets triggered 30 mins after the test completes. Also in the logs of Lambda2 I see this pattern.
Start Request
Message published to SNS...
Message published to SNS...
[98 more "Message published to SNS..."]
End Request
Logs consist of repetition of these lines. 100 lines of "message published" makes sense because the DynamoDBStream is configured with Batch Size of 100. Each request to Lambda2 takes 50-60 seconds which means it will take ~90 mins for all messages to end up in SQS.
What bothers me is that, every "Start Request" comes after an "End Request". So, root cause seems like DynamoDBStream is not triggering Lambda2 in parallel.
Question
Why is DynamoDBStream not triggering lambda function in parallel? Am I missing a configuration?
Solution
After taking the advice from the answer and comment here is my solution.
I was re-creating SNS client before publishing each message. I made it a static variable in my class and Lambda2 started executing in ~15 seconds.
Then, I increased batch size of DynamoDB trigger to 1000.
Inside Lambda2 I processed (publish to SNS) DynamoDB records using 10 threads in parallel.
Increased Lambda2 memory allocation from 192MB to 512MB.
With these optimizations I can see all 10000 messages in SQS, 10-15 seconds after all http requests were sent.
Conclusion :)
In order to find the optimum (cheap & acceptable latency) solution, we need to make several tests with different batch size, number of threads, allocated memory etc.
There is no way as of now to trigger DynamoDBStream to trigger in parallel. It is only a sequential delivery and in batch configured.
There is no partial delivery also. If you have a batch delivering to your lambda, you need to complete all the elements in batch. Otherwise it will deliver the same batch or with more records later.
Also you need to complete lambda successfully for next batch, if that errors out, it will call the lambda repeatedly until it gets delivered successfully or the lifetime of the data in the stream.

Controlling Lambda + Kinesis Costs

We have a .NET client application that uploads files to S3. There is an event notification registered on the bucket which triggers a Lambda to process the file. If we need to do maintenance, then we suspend our processing by removing the event notification and adding it back later when we're ready to resume processing.
To process the backlog of files that have queued up in S3 during the period the event notification was disabled, we write a record to a kinesis stream with the S3 key to each file, and we have an event mapping that lets Lambda consume each kinesis record. This works great for us because it allows us to control our concurrency when we are processing a large backlog by controlling the number of shards in the stream. We were originally using SNS but when we had thousands of files that needed to be reprocessed SNS would keep starting Lambdas until we hit our concurrent executions threshold, which is why we switched to Kinesis.
The problem we're facing right now is that the cost of kinesis is killing us, even though we barely use it. We get 150 - 200 files uploaded per minute, and our lambda takes about 15 seconds to process each one. If we suspend processing for a few hours we end up with thousands of files to process. We could easily reprocess them with a 128 shard stream, however that would cost us $1,400 / month. The current cost for running our Lambda each month is less than $300. It seems terrible that we have to increase our COGS by 400% just to be able to control our concurrency level during a recovery scenario.
I could attempt to keep the stream size small by default and then resize it on the fly before we re-process a large backlog, however resizing a stream from 1 shard up to 128 takes an incredibly long time. If we're trying to recover from an unplanned outage then we can't afford to sit around waiting for the stream to resize before we can use it. So my questions are:
Can anyone recommend an alternative pattern to using kinesis shards for being able to control the upper bound on the number of concurrent lambdas draining a queue?
Is there something I am missing which would allow us to use Kinesis more cost efficiently?
You can use SQS with Lambda or Worker EC2s.
Here is how it can be achieved (2 approaches):
1. Serverless Approach
S3 -> SNS -> SQS -> Lambda Sceduler -> Lambda
Use SQS instead of Kinesis for storing S3 Paths
Use a Lambda Scheduler to keep polling messages (S3 paths) from SQS
Invoke Lambda function from Lambda scheduler for processing files
2. EC2 Approach
S3 -> SNS -> SQS -> Beanstalk Worker
Use SQS instead of Kinesis for storing S3 Paths
Use Beanstalk Worker environment which polls SQS automatically
Implement the application (processing logic) in the Beanstalk worker hosted locally on a HTTP server in the same EC2