AWS Lambda execution duration randomly spikes and causes time-outs

AWS Lambda execution duration randomly spikes and causes time-outs - amazon-web-services

I'm building a server-less web-tracking system which serves its tracking pixel using AWS API Gateway, which calls a Lambda function whenever a tracking request arrives to write the tracking event into a Kinesis stream.
The Lambda function itself does not do anything fancy. It just a takes the incoming event (its own argument) and writes it to the stream. Essentially, it's just:
import boto3
kinesis_client = boto3.client("kinesis")
kinesis_stream = "my_stream_name"
def return_tracking_pixel(event, context):
...
new_record = ...(event)
kinesis_client.put_record(
StreamName=kinesis_stream,
Data=new_record,
PartitionKey=...
)
return ...
Sometimes I experience a weird spike in the Lambda execution duration that causes some of my Lambda function invocations to time-out and the tracking requests to be lost.
This is the graph of 1-minute invocation counts of the Lambda function in the in affected time period:
Between 20:50 and 23:10 I suddenly see many invocation errors (1-minute error counts):
which are obviously caused by the Lambda execution time-out (maximum duration in 1-minute intervals):
There is nothing weird going on neither with my Kinesis stream (data-in, number of put records, put_record success count etc., all looks normal), nor with my API GW (number of invocations corresponds to number of API GW calls, well within the limits of the API GW).
What could be causing the sudden (and seemingly randomly occurring) spike in the Lambda function execution duration?
EDIT: neither the lambda functions are being throttled, which was my first idea.

Just to add my 2 cents, because there's not much investigative work without extra logging or some X-Ray analysis.
AWS Lambda sometimes will force recycle containers which will feel like cold starts even though your function is being reasonably exercised and warmed up. This might bring all cold start related issues, like extra delays for ENIs if your Lambda has an attached VPC and so on... but even for a simple function like yours, 1 second timeout is sometimes too optimistic for a cold start.
I don't know of any documentation on those forced recycles, other than some people having evidence for it.
"We see a forced recycle about 7 times a day." source
"It also appears that even once warmed, high concurrency functions get recycled much faster than those with just a few in memory." source
I wonder how you could confirm this is the case. Perhaps you could check those errors appearing in Cloud Watch log streams to be from containers that never appeared before.

Related

AWS Lambda Function has occassional huge 10 minute gaps between initialization and invocation

We have a lambda setup which will occasionally take ten minutes between initialization and invocation, leading to severe performance degradation of the apps that depend on it. The lambda request is handled by API gateway, sent to the Lambda Context for our handler, and then sent to the lambda function itself. At this point it looks like the lambda is initialized, but will take anywhere from 1 to 10 minutes to invoke for the slow performing requests.
Provisioned concurrency seems to address the cold start problem, but we cant seem to find any indication that concurrency is the problem. Furthermore, we have no reason to believe this is a cold start problem, given that the request takes 10 minutes, versus 10 seconds. I have no idea where to start to address this problem. Can someone give me some tips?

Request takes too long to Django app deployed in AWS Lambda via Zappa

I have recently deployed a Django backend application to AWS Lambda using Zappa.
After the lambda function has not been invoked for some time, the first request to be made takes from 10 to 15 seconds to be processed. At first I thought it would be because of the cold start but even for a cold start this time is unacceptable. Then, reading through Zappa's documentation I saw that it enables by default the keep_warm feature that sends a dummy request to the lambda function every 4 minutes to keep it warm; so this excessive delay in the response to the first request to the lambda is not due to a cold start.
Then, I started using tools such as AWS X-Ray and Cloudwatch Insights to try to find the explanation for the delay. Here is what I found out:
The invokation that takes a very long time to be processed is the following:
Crossed out in red are the names of the environment variables the application uses. They are all defined and assigned a value directly in the AWS Console. What I don't understand is, first of all, why it takes so long, and secondly, why it says the environment variables are casted as None. The application works perfectly (apart from the massive delay in the first request) so the environment variables are correctly set somewhere.
This request is made every two hours religiously and the first time someone invokes the lambda function in some time, as seen in the following chart:
The dots in the x axis correspond to Zappa's dummy requests to keep the server warm. The elevated dots correspond to the invocation shown in the previous image. Finally, the spike corresponds to a user invocation. The time it took to process is the sum of the time it takes to process the long invocation (the one shown in the first image) and the time it takes to process the longest http request the client makes to the server. This request was the following:
It was a regular login request that should be resolved much faster. Other requests that are probably more demanding than this one were resolved in less than 100ms.
So, to sum up:
There is one lambda invocation that takes more than 10 seconds to be resolved. This corresponds to the first image shown. It is done every 2 hours and when a user makes a request to the server after it has been idle for some time.
Some requests take more than 2 seconds to be resolved and I have no idea as to why this could be.
Apart from these previous function invocations, all other requests are resolved in a reasonable time frame.
Any ideas as to why these invocations could be taking so much time is very much appreciated as I have spent quite some time trying to figure it out on my own and I have ran out of ideas. Thank you in advance!
Edit 1 (28/07/21): to further support my suspicion that this delay is not due to a cold start here is the "Segments Timeline" of the function in Cloudwatch/Application monitoring/Traces:
If it were a cold start, the delay should appear in the "Initialization" segment and not in the "Invocation" one.
Edit 2 (30/07/21): I forgot to mention that I had previously deployed the application using Elastic Beanstalk and didn't face this problem whatsoever so my code's performance is probably not the problem here.
Edit 3 (30/07/21): I found this thread in an AWS forum from 2016 regarding this exact issue. An AWS engineer mentioned that this behaviour is not by any means expected for a Lambda function outside of a VPC (like mine). Nevertheless, no answer was provided that explained the cause of the 10-15 seconds delay.
Edit 4 (03/08/21): I tried doubling the function's assigned memory (from 512 MB to 1024 MB) but it did not help.

I have also added some comments to the question to explain that this is probably not due to a cold start. As you rightly stated, cold starts are explicitly indicated and seem to only take about 500 ms in your case.
Cold starts this long usually only manifested themselves when lambdas were run in a VPC. And AWS has since changed the way lambdas get their network interface which has dramatically sped up that process.
That being said, a quick Google search led me to some interesting discussions on other sites about Django applications and lazy loading. I'll share some links here (even though they are not related to Lambda) in the hope they can help you find a solution:
https://community.webfaction.com/questions/11560/django-app-seems-very-slow-to-start-up-10-seconds
https://ses4j.github.io/2015/11/23/optimizing-slow-django-rest-framework-performance/
As a last note about the keep_warm. Sending those requests is quite an old trick in the book. However, be aware that there are no guarantees as to how long a lambda is kept warm by AWS. If an Init duration is indicated in the logs, however, you can be sure that it was a cold start.
If you need to ensure that a lambda function is warm and quick to respond to incoming requests, you'll have to use provisioned concurrency, which of course has its own price tag.

I can see some suggestions here on trying to increase the memory for your lambda (and I also saw that you tried from 512 to 1024). Have you tried increasing it further, say to about 3072? It's a significant increase, but this is just to prove that the problem is not due to resource limitations first.
The keep_warm feature isn't guaranteed as far as I've seen, and bulk of the (cold) start time is due to initialisation. Since the vcpu allocated to the lambda is proportional to the memory you assign to it, your lambda may initialise quicker and somehow mitigate these cold starts.

Rate Exceeded on AWS Lambda Using API Gateway and serverless framework

When I try to invoke a method that has a HTTP event it results in 500 Internal server error.
On CloudWatch logs it shows Recoverable error occurred (Rate Exceeded.)
When I try invoke a function without lambda it executes with response.
Here is my serverless config:

You have set your Lambda's reservedConcurrency to 0. This will prevent your Lambda from ever being invoked. Setting it to 0 is usually useful when your functions are getting invoked but you're not sure why and you want to stop it right away.
If you want to have it invoked, change reservedConcurrency to a positive integer (by default, it can be a positive integer <= 1000, but you can increase this limit by contacting AWS) or simply remove the reservedConcurrency attribute from your .yml file as it will use the default values.
Why would one ever use reservedConcurrency anyways? Well, let's say your Lambda functions are triggered by requests from API Gateway. Let's say you get 400 (peak hours) requests/second and, upon every request, two other Lambda functions are triggered, one to generate a thumbnail for a given image and one to insert some metadata in DynamoDB. You'd have, in theory, 1200 Lambda functions running at the same time (given all of your Lambda functions finish their execution in less than a second). This would lead to throttling as the default concurrent execution for Lambda functions is 1000. But is the thumbnail generation as important as the requests coming from API Gateway? Very likely not as it's naturally an eventually consistent task, so you could set reservedConcurrency on the thumbnail Lambda to only 200, so you wouldn't use up your concurrency, meaning other functions would be able to spin up to do something more useful at a given point in time (in our example, receiving HTTP requests is more important than generating thumbnails). The other 800 left concurrency could then be split between the function triggered from API Gateway and the one that inserts data into DynamoDB, thus preventing throttling for the important stuff and keeping the not-so-important-stuff eventually consistent.

How can I keep warm an AWS Lambda invoked from API Gateway with proxy integration

I have defined a lambda function that is invoked from API Gateway with proxy integration. Thus, I have defined an eager resource path for it:
And referenced my lambda function:
My lambda is able to process request like GET /myresource, POST /myresource.
I have tried this strategy to keep it warm, described in acloudguru. It consists of setting up a CloudWatch event rule that invokes the lambda every 5 minutes to keep it warm. Unfortunately it isn't working.
This is the behaviour I have seen:
After some period, let's say 20 minutes, I call GET /myresource from API Gateway and it takes around 15 seconds. Subsequent requests last ~30ms. The CloudWatch event is making no difference...
Let's suppose another long period without calling the gateway. If I go to the Lambda console and invoke it directly (test button) it answers right away (less than 1ms) with a 404 (that's normal because my lambda expects GET /myresource or POST /myresource).
Immediately after this lambda console execution I call GET /myresource from API Gateway and it still takes ~20 seconds. That is to say, the function was still cold despite having being invoked from the Lambda console. This might explain why the CloudWatch event doesn't work since it calls the lambda without setting the method/resource-url.
So, how can I make this particular case with API Gateway with proxy integration + Lambda stay warm to prevent those slow first request?

As of now (2019-02-27) [1], A periodic CloudWatch event rule does not deterministically solve the cold start issue. But a periodic CloudWatch event rule will reduce the probability of cold starts.
The reason is it's upto the Lambda server to decide whether to use a new Lambda container instead of an existing container to process an incoming request. Some of the related details regarding how Lambda containers are reused is explained in [1]
In order to reduce the cold start time (not to reduce the number cold starts), can you try followings? 1. increasing the memory allocated to the function, 2. reduce the deployment package size (eg- remove unnecessary dependencies), and 3. use a language like NodeJS, Python instead of Java, .Net
[1]According to reinvent session, (39:50 at https://www.youtube.com/watch?v=QdzV04T_kec), the Lambda team expects to improve the VPC cold start latency in Lambda.
[2] https://aws.amazon.com/blogs/compute/container-reuse-in-lambda/

Denis is quite right about the non deterministic lambda behaviour regarding the number of containers hit by CloudWatch events. I'll follow his advice to improve the startup time.
On the other hand I have managed to make my CloudWatch events hit the lambda function properly, reducing (in many cases) the number of cold starts.
I just had to add an additional controller mapped to "/" with a hardcoded response:
#Controller("/")
class WarmUpController {
private val logger = LoggerFactory.getLogger(javaClass)
#Get
fun warmUp(): String {
logger.info("Warming up")
return """{"message" : "warming up"}"""
}
}
With this in place the default (/) invocation from CloudWatch does keep the container warm most of the time.

AWS Lambda Hot and Cold Start

Hello I am new to AWS Lambda.I want to know what do we mean by Hot Lambda function (Hot Start) and Cold Lambda function (Cold Start) ?? Can anyone please explain me in detail & what is the difference between Hot Lambda and Cold Lambda

After uploading your code or after periods of inactivity your Lambda is shut down or "cold". When a new event comes in there is a brief moment where Lambda spins up a new instance of your code - this includes whatever initializing AWS does to start up the "container" as well as initializing the code that you uploaded.
So an event that is able to hit an initialized("hot") Lambda will in theory be processed faster than hitting a cold one. There isn't a guarantee on how long a Lambda will stay hot after the last event but it could be as long as 5 minutes.

It's a common belief that when people refer to "warm starts" they mean that the same container/sandbox is ready to receive a new connection - but that's not accurate.
Warm Start - invoking a function using a warm container with a prebaked unused sandbox resources from previous invocations are not recycled.
Cold Start - invoking a function when no container/sandbox is ready to receive the request. A new container must be created and the runtime and user code loaded. Cold start's latency is mostly an internal metric, externally, cold starts are only a part of the total overhead that can affect the end-user experience. In some scenarios, we can encounter a portion of the full cold start, think about scale prediction and statistical algorithms
However, using the terms "cold start" and "warm start" might be misleading, as a developer you should care about "Invocation Overheads" - The time it takes to call the user's function and return the response.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js