I have an AWS Lambda Function setup with a trigger from a SQS queue. Current the queue has about 1.3m messages available. According to CloudWatch the Lambda function has only ever reached 431 invocations in a given minute. I have read that Lambda supports 1000 concurrent functions running at a time, so I'm not sure why it would be maxing out at 431 in a given minute. As well it looks like my function only runs for about 5.55s or so on average, so each one of those 1000 available concurrent slots should be turning over multiple times per minute, therefor giving a much higher rate of invocations.
How can I figure out what is going on here and get my Lambda function to process through that SQS queue in a more timely manner?
The 1000 concurrent connection limit you mention assumes that you have provided enough capacity.
Take a look at this, particularly the last bit.
https://docs.aws.amazon.com/lambda/latest/dg/vpc.html
If your Lambda function accesses a VPC, you must make sure that your
VPC has sufficient ENI capacity to support the scale requirements of
your Lambda function. You can use the following formula to
approximately determine the ENI capacity.
Projected peak concurrent executions * (Memory in GB / 3GB)
Where:
Projected peak concurrent execution – Use the information in Managing Concurrency to determine this value.
Memory – The amount of memory you configured for your Lambda function.
The subnets you specify should have sufficient available IP addresses
to match the number of ENIs.
We also recommend that you specify at least one subnet in each
Availability Zone in your Lambda function configuration. By specifying
subnets in each of the Availability Zones, your Lambda function can
run in another Availability Zone if one goes down or runs out of IP
addresses.
Also read this article which points out many things that might be affecting you: https://read.iopipe.com/5-things-to-know-about-lambda-the-hidden-concerns-of-network-resources-6f863888f656
As a last note, make sure your SQS Lambda trigger has a batchSize of 10 (max available).
Related
I have both provisioned and reserved concurrency set in my Lambda. Some of the connections are time consuming, so i have it as part of lambda INIT block and have set the provisioned concurrency to avoid cold-start related issues. Also, I need reserved concurrency, as I want a way to limit the connections made to the EC2 hosted db RO nodes to avoid too many connections to it.
I noticed that ConcurrentExecutions metric always report ReservedConcurrency - ProvisionedConcurrency, I wonder why it doesn't scale up to full ReservedConcurency target configured in the lambda? For eg. if ProvisionedConcurrency is 10, ReservedConcurrency is 30, I would have expected 10 lambdas provisioned during deployment and then the function can scale to accept maximum 30 concurrent executions whereas whereas I see ConcurrentExecutions metric as 20 in Cloudwatch.
I have a few AWS Lambda functions, but the troubleshooting is for one of them. this Lambda function is triggered by message queue, read DynamoDB, process, write DynamoDB. it is called up to 10 requests per second and I have set Lambda provision concurrency. Average Lambda duration is 60 ms which I am very happy with. But every day there are around 10 instances which Lambda function duration is more than 1 second up to 3 second timeout.
I put log in my Lambda, during duration spikes, read/write (getitem/putitem) DynamoDB took more than 1 second. Dynamodb is set to on-demend. it is a very simple table, two columns, ID (auto number) and a json string(about 1KB). I have tried Redis, but weird enough, still had spikes. Lambda is not put in VPC. Dynamo connection has been set to http timeout 500, max retry to 2.
Code to read DynamodDB:
Log for Duration:
When using provisioned concurrency, the Lambda service would keep a set number of the underlying containers "warm" so as to minimize start up time. Since you mention that you intermittently face higher execution durations, refer to the below debugging steps which you can do:
Check the "Concurrent Executions" metric for the Lambda function against the "Duration" metric: If the number of instances of the function executing at a particular time is higher than the set provisioned concurrency, then that would imply that s few of these instances had cold starts causing the higher duration.
Enable X-Ray tracing for the Lambda function and also add X-ray instrumentation to your code: This would give a complete idea of which network call takes up too much time and also give you the cold start "init" duration (if any).
from the docs
Lambda automatically scales up the number of instances of your function to handle high numbers of events.
What I understood is, if there are 10 incoming requests for a particular lambda function, then 10 instances of that runtime(lets say nodejs) will be launched.
Now, my questions:
What is the maximum number of instances that lambda allows ? (looked into docs but didn't found this)
Since there would be some maximum cap what is the fallback if that number is reached ?
The default account number is 1000, but this is a soft limit and can be increased.
Concurrency in Lambda actually works similarly to the magical pizza
model. Each AWS Account has an overall AccountLimit value that is
fixed at any point in time, but can be easily increased as needed,
just like the count of slices in the pizza. As of May 2017, the
default limit is 1000 “slices” of concurrency per AWS Region.
You can check this limit under Concurrency inside your Lambda function, just like the image below:
You can use services with some retry logic already built-in to in order to decouple your applications (think of SQS, SNS, Kinesis, etc). If the Lambda requests are all HTTP(S) though, then you will get 429 (Too Many Requests) HTTP responses and the requests will be lost.
You can see Lambda's default retry behaviour here
I understand the AWS Lambda is a serverless concept wherein a piece of code can be triggered on some event.
I want to understand how does the Lambda handle scaling?
For eg. if my Lambda function sits inside a VPC subnet as it wants to access VPC resources, and that the subnet has a CIDR of 192.168.1.0/24, which would result in 251 available IPs after subtracting the AWS reserved 5 IPs
Would that mean if my AWS Lambda function gets 252 invocations at the exact same time,Only 251 of the requests would be served and 1 would either timeout or will get executed once one of the 252 functions completes execution?
Does the Subnet size matter for the AWS Lambda scaling?
I am following this reference doc which mentions concurrent execution limits per region,
Can I assume that irrespective of whether an AWS Lambda function is No VPC or if it's inside a VPC subnet, it will scale as per mentioned limits in the doc?
Vladyslav's answer is still technically correct (Subnet size does matter), but things have changed significantly since it was written and subnet size is much less of a consideration. See aws' announcement:
Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
Your function scaling is no longer directly tied to the number of network interfaces and Hyperplane ENIs can scale to support large numbers of concurrent function executions
Yes, you are right. Subnet size definitely does matter, you have to be careful with your CIDR blocks. With that one last invocation (252nd), it depends on the way your lambda is invoked: synchronously (e.g. API Gateway) or asynchronously (e.g. SQS). If it is called synchronously, it'll be just throttled and your API will respond with 429 HTTP status, which stands for "too many requests". If it is asynchronous, it'll be throttled and will be retried within a six hour period window. More detailed description you can find on this page.
Also I recently published a post in my blog, which is related to your question. You may find it useful.
I have a lambda that is subscribed to an SQS queue to process messages. The message volume is very high.
Problem: The queue grows very quickly, and the lambda function does not scale-out to process the messages fast enough. The concurrent lambda executions goes up to only 20 to 25, even though I have a remaining quota of 950 or more un-used lambda executions. Why is it not spinning up more lambda to process my queue faster? Is this configurable?
This is an issue in my application because I am using a standard SQS queue, which provides no ordering guarantee. So, sometimes I see unlucky messages get suck in the queue for hours, whereas some messages are processed in less than one minute. (As an aside, I'm quite shocked that the queue can be processed in such a random order. Even though there is no ordering guarantee, I would not have expected it to be this bad).
The problem was the memory allocation of the lambda function. I had naively left it as the default of 128MB. Changing this to 2048MB completely resolved the issue. The lambda now has no issues keeping up with high volumes of SQS messages.
Regarding SQS, you didn't say what region you are using but SQS does have a FIFO option in come regions.
FIFO queues are available in the US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Ireland), Asia Pacific (Sydney), and Asia Pacific (Tokyo) regions. FIFO queues have all the capabilities of the standard queue.
Regarding Lambda concurrency, It sounds like you are running out of IP addresses in the subnet you're using. This would only apply if you're using a VPC.
If your function connects to VPC based resources, you must make sure your subnets have adequate address capacity to support the ENI scaling requirements of your function. You can estimate the approximate ENI capacity with the following formula:
Concurrent executions * (Memory in GB / 3 GB)
Where:
Concurrent execution – This is the projected concurrency of your workload. Use the information in Understanding Scaling Behavior to determine this value.
Memory in GB – The amount of memory you configured for your Lambda function.
You can set the concurrent execution limit for a function to match the subnet size limits you have.
References
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html
https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html