AWS SQS Lambda Trigger and Concurrency - amazon-web-services

I've seen a number of SO questions on limiting Lambda concurrent execution but none on the inverse issue.
I need to increase my concurrent execution but am having issues. I've got a Lambda triggered off an SQS queue. I've published a version of the function and assigned it 3,000 concurrent execution (my limit has been increased to 5,000 from the default of 1,000).
Despite this, when I run my process I see hundreds of thousands of messages waiting in the queue while the Monitoring tab of my Lambda function shows my "Concurrent executions" never going above 1,250 and my "ProvisionedConcurrencyUtilization" never going above 50%. Moreover, the chart seems to imply a hard limit of 1,250.
I'd be inclined to suspect that there is some sort of limit preventing any single Lambda from using more than 25% of total provisioned capacity (1,250 is 25% of 5,000) but the AWS documentation states otherwise. I did see this SO question (AWS Lambda Triggered by SQS increases SQS request count) which discusses Labmda/SQS polling but it and the documentation it links to indicate my process should use 100% of the Provisioned Capacity. But perhaps it's the polling that's causing the issue.
In any event, these messages sit in the queue for over an hour to process ... with never more than 1,250 processing at the same time ... while the reset of that provisioned concurrency sits idle.
Any suggestions/ideas are greatly appreciated.

Jelly's suggestion was a good one.
Unfortunately, AWS says there is a hard limit of 1,250 Lambda concurrent executions when using Amazon SQS trigger.

Related

Autoscale AWS Lambda concurrency based off throttling errors

I have a AWS Lambda function using an AWS SQS trigger to pull messages, process them with an AWS Comprehend endpoint, and put the output in AWS S3. The AWS Comprehend endpoint has a rate limit which goes up and down throughout the day based off something I can control. The fastest way to process my data, which also optimizes the costs I am paying for the AWS Comprehend endpoint to be up, is to set concurrency high enough that I get throttling errors returned from the api. This however comes with the caveat, that I am paying for more AWS Lambda invocations, the flip side being, that to optimize the costs I am paying for AWS Lambda, I want 0 throttling errors.
Is it possible to set up autoscaling for the concurrency limit of the lambda such that it will increase if it isn't getting any throttling errors, but decrease if it is getting too many?
Very interesting use case.
Let me start by pointing out something that I found out the hard way in an almost 4 hour long call with AWS Tech Support after being puzzled for a couple days.
With SQS acting as a trigger for AWS Lambda, the concurrency cannot go beyond 1K. Even if the concurrency of Lambda is set at a higher limit.
There is now a detailed post on this over at Knowledge Center.
With that out of the way and assuming you are under 1K limit at any given point in time and so only have to use one SQS queue, here is what I feel can be explored:
Either use an existing cloudwatch metric (via Comprehend) or publish a new metric that is indicative of the load that you can handle at any given point in time. you can then use this to set an appropriate concurrency limit for the lambda function. This would ensure that even if you have SQS queue flooded with messages to be processed, lambda picks them up at the rate at which it can actually be processed.
Please Note: This comes out of my own philosophy of being proactive vs being reactive. I would not wait for something to fail to trigger other processes eg invocation errors in this case to adjust concurrency. System failures should be rare and actually raise alarm (if not panic!) rather than being normal that occurs a couple of times a day !
To build up on that, if possible I would suggest that you approach this the other way around i.e. scale Comprehend processing limit and AWS Lambda concurrency based on the messages in the SQS queue (backlog) or a combination of this backlog and the time of the day etc. This way, if every part of your pipeline is a function of the amount of backlog in the Queue, you can be rest assured that you are not spending more than you have at any given point in time.
More importantly, you always have capacity in place should the need arise or something out of normal happens.

Scaling AWS Lambda with SQS

I want to use SQS for calling Lambda.
An execution time of lambda function is 3 minutes.
I want to execute 1000 lambda functions at once, so I send 1000 messages to SQS queue
But according to an AWS documentation
Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute.
https://docs.aws.amazon.com/en_us/lambda/latest/dg/scaling.html
So I should wait a few minutes until all messages will be processed. Is there any workaround to call 1000 concurrent lambda and avoid "cold start"?
UPD: I got answer from AWS support
You are correct that SQS will start at an initial burst of 5 and
increase by a concurrency of 60 per minute. Scaling rates can't be
increased
If you see the Automatic Scaling section of the documentation page that describes the autoscaling behaviour under sudden load. I don't think cold start would be a problem. The first batch of concurrent Lambdas executions would likely see the cold start and all the subsequent invocations would be fine.

AWS Lambda is seemingly not highly available when invoked from SNS

I am invoking a data processing lambda in bulk fashion by submitting ~5k sns requests in an asynchronous fashion. This causes all the requests to hit sns in a very short time. What I am noticing is that my lambda seems to have exactly 5k errors, and then seems to "wake up" and handle the load.
Am I doing something largely out of the ordinary use case here?
Is there any way to combat this?
I suspect it's a combination of concurrency, and the way lambda connects to SNS.
Lambda is only so good at automatically scaling up to deal with spikes in load.
Full details are here: (https://docs.aws.amazon.com/lambda/latest/dg/scaling.html), but the key points to note that
There's an account-wide concurrency limit, which you can ask to be
raised. By default it's much less than 5k, so that will limit how
concurrent your lambda could ever become.
There's a hard scaling limit (+1000 instances/minute), which means even if you've managed to convince AWS to let you have a concurrency limit of 30k, you'll have to be under sustained load for 30 minutes before you'll have that many lambdas going at once.
SNS is a non-stream-based asynchronous invocation (https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-function.html#supported-event-source-sns) so what you see is a lot of errors as each SNS attempts to invoke 5k lambdas, but only the first X (say 1k) get through, but they keep retrying. The queue then clears concurrently at your initial burst (typically 1k, depending on your region), +1k a minute until your reach maximum capacity.
Note that SNS only retries three times at intervals (AWS is a bit sketchy about the intervals, but it is probably based on the retry: delay the service returns, so should be approximately intelligent); I suggest you setup a DLQ to make sure you're not dropping messages because the time for the queue to clear.
While your pattern is not a bad one, it seems like you're very exposed to the concurrency issues that surround lambda.
An alternative is to use a stream based event-source (like Kinesis), which processes in batches at a set concurrency (e.g. 500 records per lambda, concurrent by shard count, rather than 1:1 with SNS), and waits for each batch to finish before processing the next.

AWS Lambda Polling from SQS: in-flight messages count

I have 20K message in SQS queue. I also have a lambda will process the SQS messages, and put data into ElasticSearch server.
I have configured SQS as the lambda's trigger, and limited the Lambda's SQS batch size to be 10. I also limited the only one instance of the lambda can be run at a giving time.
However, sometime I see over 10K in-flight messages from the AWS console. Should it be max at 10 in-flight messages?
Because of this, the lambdas will only able to process 9K of the SQS message properly.
Below is a screen capture to show that I have limited the lambda to have only 1 instance running at a giving time.
I've been doing some testings and contacting AWS tech support at the same time.
What I do believe at the moment is that:
Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute. Doc
1/ The thing that does that polling, is a separate entity. It is most likely to be a lambda function that will long-poll the SQS and then, invoke our lambda functions.
2/ That polling Lambda does not take into account any of our Receiver-Lambda at all. It does not care whether the function is running at max capacity or not, or how many max concurrency is available for the Receiver-Lambda
3/ Due to that combination. The behavior is not what we expected from the Lambda-SQS integration. And worse, If you have suddenly, millions of message burst in your queue. The Receiver-Lambda concurrency can never catch up with the amount of messages that the polling Lambda is sending, result in loss of work
The test:
Create one Lambda function that takes 30 seconds to return true;
Set that function's concurrency to 50;
Push 300 messages into the queue ( Visibility timeout : 10 Minutes, batch message count: 1, no re-drive )
The result:
Amount of messages available just increase gradually
At first, there are few enough messages to be processed by Receiver-Lambda
After half a minute, there are more messages available than what Receiver-Lambda can handle
These message would be discarded to dead queue. Due to polling Lambda unable to invoke Receiver-Lambda
I will update this answer as soon as I got the confirmation from AWS support
Support answer. As of Q1 2019, TL;DR version
1/ The assumption was correct, there was a "Poller"
2/ That Poller do not take into consideration of reserved concurrency
as part of its algorithm
3/ That poller have hard limit of 1000
Q2-2019 :
The above information need to be updated. Support said that the poller correctly consider reserved concurrency but it should be at least 5. The SQS-Lambda integration is still being updated and this answer will not. So please consult AWS if you get into some weird issues

AWS Lambda > what happens when reach concurrent limit

Lambda has a 100 function limit.
What happens when you submit a 101 function when 100 are already running?
Will it:
fail with an error
queue up
If you are talking about Concurrent executions there isn't a limit of 100. The limit depends on the region but by default it's 1000 Concurrent executions.
To answer your question: As soon as the Concurrent executions limit is reached the next execution gets throttled. Each throttled invocation increases the Amazon CloudWatch Throttles metric for the function.
If your AWS Lambda is invoked asynchronous AWS Lambda automatically retries the throttled event for up to six hours, with delays between retries. If you didn't setup a Dead Letter Queue (DLQ) for your AWS Lambda your event is lost as soon as all retries fails.
For more information please check the AWS Lambda - Throttling Behavior
If the function doesn't have enough concurrency available to process all events, additional requests are throttled. For throttling errors (429) and system errors (500-series), Lambda returns the event to the queue and attempts to run the function again for up to 6 hours. The retry interval increases exponentially from 1 second after the first attempt to a maximum of 5 minutes. However, it might be longer if the queue is backed up. Lambda also reduces the rate at which it reads events from the queue.
As mentioned here.