Why is DynamoDBStream not triggering lambda function in parallel? - amazon-web-services

I have this setup
ApiGateway -> Lambda1 -> DynamoDB -> Lambda2 -> SNS -> SQS
Here is what am I trying to do:
Make an http request to ApiGateway.
ApiGateway is integrated to Lambda1, so Lambda1 gets executed.
Lambda1 inserts an object to DynamoDB.
DynamoDBStream triggers Lambda2. Batch size is 100.
Lambda2 publishes a message to SNS for every inserted record.
SQS is subscribed to SNS.
Basically, if I make an http request to Api Gateway I expect to see a message ending up in SQS. Actually, for a single request everything works as expected.
I made this test:
Make 10 http request to warmup lambda functions and wait for 30 seconds.
Create 100 threads. Each thread will make an http request until total request number is 10000.
2nd step of the test completes in 110 seconds. My DynamoDB is configured for 100 writes per second and this 110 seconds makes perfect sense. After 110 seconds I see these 10000 records in my DynamoDB table
The problem is that it takes too much time for messages to end up in SQS. I checked the logs of Lambda2 and I see that it still gets triggered 30 mins after the test completes. Also in the logs of Lambda2 I see this pattern.
Start Request
Message published to SNS...
Message published to SNS...
[98 more "Message published to SNS..."]
End Request
Logs consist of repetition of these lines. 100 lines of "message published" makes sense because the DynamoDBStream is configured with Batch Size of 100. Each request to Lambda2 takes 50-60 seconds which means it will take ~90 mins for all messages to end up in SQS.
What bothers me is that, every "Start Request" comes after an "End Request". So, root cause seems like DynamoDBStream is not triggering Lambda2 in parallel.
Question
Why is DynamoDBStream not triggering lambda function in parallel? Am I missing a configuration?
Solution
After taking the advice from the answer and comment here is my solution.
I was re-creating SNS client before publishing each message. I made it a static variable in my class and Lambda2 started executing in ~15 seconds.
Then, I increased batch size of DynamoDB trigger to 1000.
Inside Lambda2 I processed (publish to SNS) DynamoDB records using 10 threads in parallel.
Increased Lambda2 memory allocation from 192MB to 512MB.
With these optimizations I can see all 10000 messages in SQS, 10-15 seconds after all http requests were sent.
Conclusion :)
In order to find the optimum (cheap & acceptable latency) solution, we need to make several tests with different batch size, number of threads, allocated memory etc.

There is no way as of now to trigger DynamoDBStream to trigger in parallel. It is only a sequential delivery and in batch configured.
There is no partial delivery also. If you have a batch delivering to your lambda, you need to complete all the elements in batch. Otherwise it will deliver the same batch or with more records later.
Also you need to complete lambda successfully for next batch, if that errors out, it will call the lambda repeatedly until it gets delivered successfully or the lifetime of the data in the stream.

Related

AWS Lambda read from SQS without concurrency

My requirement is like this.
Read from a SQS every 2 hours, take all the messages available and then process it.
Processing includes creating a file with details from SQS messages and sending it to an sftp server.
I implemented a AWS Lambda to achieve point 1. I have a Lambda which has an sqs trigger. I have set batch size as 50 and then batch window as 2 hours. My assumption was that Lambda will get triggered every 2 hours and 50 messages will be delivered to the lambda function in one go and I will create a file for every 50 records.
But I observed that my lambda function is triggered with varied number of messages(sometimes 50 sometimes 20, sometimes 5 etc) even though I have configured batch size as 50.
After reading some documentation I got to know(I am not sure) that there are 5 long polling connections which lambda spawns to read from SQS and this is causing this behaviour of lambda function being triggered with varied number of messages.
My question is
Is my assumption on 5 parallel connections being established correct? If yes, is there a way I can control it? I want this to happen in a single thread / connection
If 1 is not possible, what other alternative do I have here. I do not want to have one file created for every few records. I want one file to be generated every two hours with all the messages in sqs.
A "SQS Trigger" for Lambda is implemented with the so-called Event Source Mapping integration, which polls, batches and deletes messages from the queue on your behalf. It's designed for continuous polling, although you can disable it. You can set a maximum batch size of up to 10,000 records a function receives (BatchSize) and a maximum of 300s long polling time (MaximumBatchingWindowInSeconds). That doesn't meet your once-every-two-hours requirement.
Two alternatives:
Remove the Event Source Mapping. Instead, trigger the Lambda every two hours on a schedule with an EventBridge rule. Your Lambda is responsible for the SQS ReceiveMessage and DeleteMessageBatch operations. This approach ensures your Lambda will be invoked only once per cron event.
Keep the Event Source Mapping. Process messages as they arrive, accumulating the partial results in S3. Once every two hours, run a second, EventBridge-triggered Lambda, which bundles the partial results from S3 and sends them to the SFTP server. You don't control the number of Lambda invocations.
Note on scaling:
<Edit (mid-Jan 2023): AWS Lambda now supports SQS Maximum Concurrency>
AWS Lambda now supports setting Maximum Concurrency to the Amazon SQS event source, a more direct and less fiddly way to control concurrency than with reserved concurrency. The Maximum Concurrency setting limits the number of concurrent instances of the function that an Amazon SQS event source can invoke. The valid range is 2-1000 concurrent instances.
The create and update Event Source Mapping APIs now have a ScalingConfig option for SQS:
aws lambda update-event-source-mapping \
--uuid "a1b2c3d4-5678-90ab-cdef-11111EXAMPLE" \
--scaling-config '{"MaximumConcurrency":2}' # valid range is 2-1000
</Edit>
With the SQS Event Source Mapping integration you can tweak the batch settings, but ultimately the Lambda service is in charge of Lambda scaling. As the AWS Blog Understanding how AWS Lambda scales with Amazon SQS standard queues says:
Lambda consumes messages in batches, starting at five concurrent batches with five functions at a time. If there are more messages in the queue, Lambda adds up to 60 functions per minute, up to 1,000 functions, to consume those messages.
You could theoretically restrict the number of concurrent Lambda executions with reserved concurrency, but you would risk dropped messages due to throttling errors.
You could try to set the ReservedConcurrency of the function to 1. That may help. See the docs for reference.
A simple solution would be to create a CloudWatch Event Trigger (similar to a Cronjob) that triggers your Lambda function every two hours. In the Lambda function, you call ReceiveMessage on the Queue until you get all messages, process them and afterward delete them from the Queue. The drawback is that there may be too many messages to process within 15 minutes so that's something you'd have to manage.

How to trigger Lambda if only there is CloudWatch logs and only once in 5minutes

I have attached CloudWatch logs trigger to my lambda (lambda is with concurrency=1). The lambda makes Athena query which costs us money.
The problem is if I have 10 (cloud watch log) files dumped in 2 second time, the lambda is invoked 10 times --This is costly to me because the lambda runs a costly Athena query.
What I want is to trigger the lambda once every 5minutes (like in DynamoDB trigger. Exactly like #MrOverflow said in the comments) if only there was a CloudWatch log generated in last 5minutes. How do I do this (preferably without writing code)?
Edit 1
I can't have fixed 5 minutes trigger, as this will trigger the Athena query even when there are no activity around.
Edit 2
This is the solution I think will work. But I am not sure how to implement it.
have trigger on cloud watch > Timestamp column with expression like: Timestamp in second/5*60 % 0 --> The advantage here is, all night/ holidays when there is no traffic my Athena will not run. Also, my current lambda will get trigged every 5 minutes.
However the downside of the solution is that, if the upstream is not generating log exactly at 5th minute second then the lambda is not triggered. Also, if you have 10 logs in the same second then the lambda gets triggered 10 times.
The other approach in mind is, to trigger the lambda every 5minutes by cloud watch. But maintain stage in DynamoDB. If the lambda is not triggered in last 5minutes then ignore the call from the cloud watch.
This involves coding which I hate to do.
Another option: Scheduled Lambda with a CloudWatch FilterLogEvents API call:
A Scheduled Event triggers a Lambda every 5 minutes, a la #MrOverflow.
The Lambda calls FilterLogEvents, setting the startTime param to 5 minutes ago, limit to 1, and optionally setting a filter pattern.
If the response events array is not empty, at least 1 file was received in the past 5 minutes. Run the Athena query.
If the response events array is empty, exit the lambda.*
The Athena job will run 0 or 1 times every 5 minutes.
* You can imagine edge cases where latencies in event triggering and logging might cause this approach to miss an eligible log event. If false negatives are a concern, consider having the lambda trigger the Athena run periodically during peak periods even if the events array is empty.

AWS lambda not processing S3 file fast enough

My requirement is to process files that gets created in S3 and stream the content of the file to SQS queue which will be consumed by other processes.
When a new files gets created in the S3 bucket, notification is published to SQS queue which triggers the lambda and the lambda written in Python process the file and publishes the content to SQS queue. File size at max is 100 MB so it might have 300K message but it is being processed very slow. I am not sure where the problem is, I have set the lambda memory limit to 10 GB and runtime to 15 mins. also I have set the concurrency limit to 100
S3---->SQS--->lambda-->SQS
I have the set the visibility timeout to 30 mins for the message; the processing is so slow that it moves the file creation message to dead letter queue.
It will take somewhere between 10 and 50 milliseconds to write a single message to SQS. If you have 300,000 messages that you're trying to write in a single Lambda invocation, then that's 3,000 seconds in the best case, which is larger than the Lambda timeout.
Once the Lambda times out, any SQS messages that it was processing will go back on the queue, and delivered against once their visibility timeout expires.
You can try multi-threading the code that writes messages to SQS. Since it's mostly network IO, you should be able to scale linearly up to a dozen or so threads. That may, however, just mean that your downstream message handlers get overloaded.
Also, reduce your batch size to 1. SQS will invoke multiple Lambdas if there are more messages in the queues.

AWS SQS triggered lambda suddenly stalling and not deleting messages

I have a lambda python function function that is connected to an SQS queue trigger with batch size 1. The SQS messages contain a file location on S3, along with a few metadata values.
When a message becomes available, the function reads some metadata from the file on S3 referenced in the message, creates a YAML file with more metadata which is then dumped to S3 and references the metadata file in an RDS database.
After I submit a load of messages to the queue (~1.7k) , all seems to go well initially, with the number of messages available dropping and the lambda executions ramping up.
But after some time, the execution time increases significantly to the point where the functions time out (time out is set at 90 secs). I don't see any errors in the logs, and the executions are still successfully (if they don't time out).
All of this can be seen in the monitoring:
Here in the lambda monitoring, you can see the sudden increase in duration, coinciding with a drop in concurrent executions and sudden appearance of errors (at worst there are two errors, 60 %s success rate). The gap you see is me disabling and enabling the trigger hoping for a change.
Here's the SQS monitoring for the same period:
You can see the number of messages visible leveling out at 192, and the number of messages received at 5. More puzzling for me, even though there are successful executions, the number of messages deleted drops to 0.
I really can't figure out why this issue is appearing now, I've been using this configuration w/o issues and changes.
Can it be that the SQS trigger configuration blocks the queue when there's a timeout reading from S3? Any clues?
Thanks!
Edit:
The RDS cluster metrics:
If the lambda successfully processes messages but the SQS queues does not delete any messages that most likely indicates a mismatch between the queue visibility timeout and the lambda timeout. You should make sure that the lambda service that picks up the message has enough time to finish the message and to tell SQS to delete the message. If the lambda takes 70 seconds but the queue only has a visibility timeout of 60s that means the DeleteMessage request by the lambda service will be silently rejected and the message will remain in the queue and will be re-processed again at a later time, potentially with the exact same outcome.
First note: If you have a concurrency limit set for the lambda the visibility timeout for the queue should not only be equal to the lambda timeout but to a multiple of the lambda timeout, 5 or 6 times the lambda timeout. The reason for that is that the lambda service may pick up the message, try to invoke a lambda, but the lambda throttles it, the lambda service then waits (lambda timeout) to retry the message. During all that the lambda services does not return the message to the queue, it keeps it in memory, does not extend the visibility timeout or anything like that. It retries a couple of (5 or 6 times) before the messages is actually discarded / returned to SQS. You should be able to try this out by creating a lambda with a timeout of e.g. 10 seconds, having it simply sleep / wait for 9 seconds, have a concurrency limit of 1 and then putting 1000 messages into the queue.
Second note: these kind of sudden bulk operations can cause all sorts of throttling issues that don't occur normally, either by other down-stream services of your own or even AWS' services. E.g. if your lambda performs an assume-role call or retrieves some config object from S3 having 500 requests the instant the messages are in the queue will often get you into trouble. The underlying database may become slow / unresponsive buffering all the incoming requests, etc.
An easy solution to that problem is to throttle the lambda by setting its concurrency limit. At that point make sure the queues has a proper visibility timeout as detailed in the previous section. And to make sure you are alerted of an actual increase in requests make sure that you watch ApproximateAgeOfOldestMessage metric of the queue to be alerted if there is an increasing backlog.
Third note: if the lambda only misbehaves when a lot of requests are coming in one potential reason is a memory leak in the lambda. Since the execution contexts of a lambda are reused between different invocations the memory leak lives across different invocations as well. If there are few requests coming in you may always get a new execution context meaning the lambda starts with fresh memory each time, but if a lot of requests are coming in the execution contexts are certainly getting reused which might cause the leak to get so big the lambda basically freezes up due to garbage collection kicking in. Same goes for the /tmp directory in the lambda.

AWS Lambda Polling from SQS: in-flight messages count

I have 20K message in SQS queue. I also have a lambda will process the SQS messages, and put data into ElasticSearch server.
I have configured SQS as the lambda's trigger, and limited the Lambda's SQS batch size to be 10. I also limited the only one instance of the lambda can be run at a giving time.
However, sometime I see over 10K in-flight messages from the AWS console. Should it be max at 10 in-flight messages?
Because of this, the lambdas will only able to process 9K of the SQS message properly.
Below is a screen capture to show that I have limited the lambda to have only 1 instance running at a giving time.
I've been doing some testings and contacting AWS tech support at the same time.
What I do believe at the moment is that:
Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute. Doc
1/ The thing that does that polling, is a separate entity. It is most likely to be a lambda function that will long-poll the SQS and then, invoke our lambda functions.
2/ That polling Lambda does not take into account any of our Receiver-Lambda at all. It does not care whether the function is running at max capacity or not, or how many max concurrency is available for the Receiver-Lambda
3/ Due to that combination. The behavior is not what we expected from the Lambda-SQS integration. And worse, If you have suddenly, millions of message burst in your queue. The Receiver-Lambda concurrency can never catch up with the amount of messages that the polling Lambda is sending, result in loss of work
The test:
Create one Lambda function that takes 30 seconds to return true;
Set that function's concurrency to 50;
Push 300 messages into the queue ( Visibility timeout : 10 Minutes, batch message count: 1, no re-drive )
The result:
Amount of messages available just increase gradually
At first, there are few enough messages to be processed by Receiver-Lambda
After half a minute, there are more messages available than what Receiver-Lambda can handle
These message would be discarded to dead queue. Due to polling Lambda unable to invoke Receiver-Lambda
I will update this answer as soon as I got the confirmation from AWS support
Support answer. As of Q1 2019, TL;DR version
1/ The assumption was correct, there was a "Poller"
2/ That Poller do not take into consideration of reserved concurrency
as part of its algorithm
3/ That poller have hard limit of 1000
Q2-2019 :
The above information need to be updated. Support said that the poller correctly consider reserved concurrency but it should be at least 5. The SQS-Lambda integration is still being updated and this answer will not. So please consult AWS if you get into some weird issues