AWS Lambda inside VPC. 504 Gateway Timeout (ENI?) - amazon-web-services

I have a Serverless .net core web api lambda application deployed on AWS.
I have this sitting inside a VPC as I access ElasticSearch service inside that same VPC.
I have two API microservices that connect to the Elasticsearch service.
After a period of non use (4 hours, 6 hours, 18 hours - I'm not sure exactly but seems random), the function becomes unresponsive and I get a 504 gateway timeout error, "Endpoint cannot be found"
I read somewhere that if "idle" for too long, the ENI is released back into the AWS system and that triggering the Lambda again should start it up.
I can't seem to "wake" up the function by calling it as it keeps timing out with the above error (I have also increased the timeouts from default).
Here's the kicker - If I make any changes to the specific lambda function, and save those changes (this includes something as simple as changing the timeout value) - My API calls (BOTH of them, even though different lambdas) start working again like it has "kicked" back in. Obviously the changes do this, but why?
Obviously I don't want timeouts in a production environment regardless of how much, OR how little the lambda or API call is used.
I need a bulletproof solution to this. Surely it's a config issue of some description but I'm just not sure where to look.
I have altered Route tables, public/private subnets, CIDR blocks, created internet gateways, NAT etc. for the VPC. This all works, but these two lambdas, that require VPC access, keeps falling "asleep" somehow.

The is because of Cold Start of Lambda.
There is a new feature which was release in reInvent 2019, where in there is a provisioned concurrency for lambda (don't get confused with reserved concurrency).
Ensure the provisioned concurrency to minimum 1 (or the amount of requests to be served in parallel) to have lambda warm always and serve requests
Ref: https://aws.amazon.com/blogs/aws/new-provisioned-concurrency-for-lambda-functions/
To get more context, Lambda in VPC uses hyperplane ENI and functions in the same account that share the same security group:subnet pairing use the same network interfaces.
If Lambda functions in an account go idle for sometime (typically no usage for 40 mins across all functions using that ENI, as I got this time info from AWS support), the service will reclaim the unused Hyperplane resources and so very infrequently invoked functions may still see longer cold-start times.
Ref: https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/

Related

lambda and fargate errors/timeouts

i have a python api that i have tried on vms, fargate, and lambda.
vms - less errors when capacity is large enough
fargate - second less errors when capacity is large enough, but when autoscaling, i get some 500 errors. looks like it doesn't autoscale quick enough.
lambda - less consistent. when there are a lot of api calls, less errors. but from cold start, it may periodically fail. i do not pre-provision. when i do, i get less errors too.
i read on the below post, cold start for lambda is less than 1 sec? seems like it's more. one caveat is that each lambda function will check for an existing "env" file. if it does not exist, it will download from s3. however this is done only when hitting the api. the lambda function is listening and responding. when you hit api, the lambda function will respond and connect, download the .env file, and process further the api call. fargate also does the same, but less errors again. any thoughts?
i can pre-provision, but it gets kind of expensive. at that point, i might as will go back to VMs with autoscaling groups, but it's less cloud native. the vms provide the fastest response by far and harder to manage.
Can AWS Lambda coldout cause API Gateway timeout(30s)?
i'm using an ALB in front of lambda and fargate. the vms simply use round robin dns.
questions:
am i doing something wrong with fargate or lambda? are they alright for apis or should i just go back to vms?
what or who maintains api connection while lambda is starting up from a cold start? can i have it retry or hold on to the connection longer?
thanks!
am i doing something wrong with fargate or lambda? are they alright for apis or should i just go back to vms?
The one thing that strikes me is downloading env from s3. Wouldn't it be easier and faster to keep your env data in SSM Parameter Store? Or perhaps, passing them as env variables to the lambda function itself.
what or who maintains api connection while lambda is starting up from a cold start? can i have it retry or hold on to the connection longer?
API gateway. Sadly you can't extend 30 s time limit. Its hard limit.
i'm using an ALB in front of lambda and fargate.
It seems to me that you have API gateway->ALB->Lambda function. Why would you need ALB in that? Usually there is no such need.
i can pre-provision, but it gets kind of expensive.
Sadly, this is the only way to minimize cold-starts.

Lambda execution time out after 15 mins what I can do?

I have a script running on Lambda, I've set the timeout to maximum 15 mins but it's still giving me time out error, there is not much infomation in the lofs, how I can solve this issue and spot what is taking soo much time? I tested the script locally and it's fairly quick.
Here's the error:
{
"errorMessage": "2020-09-10T18:26:53.180Z xxxxxxxxxxxxxxx Task timed out after 900.10 seconds"
}
If you're exceeding the 15 minutes period there are a few things you should check to identify:
Is the Lambda connecting to resources in a VPC? If so is it attached via VPC config, and do the target resources allow inbound access from the Lambda.
Is the Lambda connecting to a public IP but using VPC configuration? If so it will need a NAT attached to allow outbound access.
Are there any long running processes as part of your Lambda?
Once you've ruled these out consider increasing the available resources of your Lambda, perhaps its hitting a cap and is therefore performing slow. Increasing the memory will also increase the available CPU for you.
Adding comments in the code will log to CloudWatch logs, these can help you identify where in the code the slowness starts. This is done by simply calling the general output/debug function of your language i.e. print() in Python or console.log() in NodeJS.
If the function is still expected to last longer than 15 minutes after this you will need to break it down into smaller functions performing logical segments of the operation
A suggested orchestrator for this would be to use a step function to handle the workflow for each stage. If you need shared storage between each Lambda you can make use of EFS to be attached to all of your Lambdas so that they do not need to upload/download between the operations.
Your comment about it connecting to a SQL DB is likely the key. I assume that DB is in AWS in your VPC. This requires particular setup. Check out
https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
https://docs.aws.amazon.com/lambda/latest/dg/services-rds-tutorial.html
Another thing you can do is enable debug level logging and then look at the details in CloudWatch after trying to run it. You didn't mention which language your lambda uses, so how to do this could be different for the language you use. Here's how it would be done in python:
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.getLevelName('DEBUG'))

How does an AWS Lambda function scale inside a VPC subnet?

I understand the AWS Lambda is a serverless concept wherein a piece of code can be triggered on some event.
I want to understand how does the Lambda handle scaling?
For eg. if my Lambda function sits inside a VPC subnet as it wants to access VPC resources, and that the subnet has a CIDR of 192.168.1.0/24, which would result in 251 available IPs after subtracting the AWS reserved 5 IPs
Would that mean if my AWS Lambda function gets 252 invocations at the exact same time,Only 251 of the requests would be served and 1 would either timeout or will get executed once one of the 252 functions completes execution?
Does the Subnet size matter for the AWS Lambda scaling?
I am following this reference doc which mentions concurrent execution limits per region,
Can I assume that irrespective of whether an AWS Lambda function is No VPC or if it's inside a VPC subnet, it will scale as per mentioned limits in the doc?
Vladyslav's answer is still technically correct (Subnet size does matter), but things have changed significantly since it was written and subnet size is much less of a consideration. See aws' announcement:
Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
Your function scaling is no longer directly tied to the number of network interfaces and Hyperplane ENIs can scale to support large numbers of concurrent function executions
Yes, you are right. Subnet size definitely does matter, you have to be careful with your CIDR blocks. With that one last invocation (252nd), it depends on the way your lambda is invoked: synchronously (e.g. API Gateway) or asynchronously (e.g. SQS). If it is called synchronously, it'll be just throttled and your API will respond with 429 HTTP status, which stands for "too many requests". If it is asynchronous, it'll be throttled and will be retried within a six hour period window. More detailed description you can find on this page.
Also I recently published a post in my blog, which is related to your question. You may find it useful.

how to reduce AWS Api gateway response time

response from an api from aws api gateway integrated to a lambda
function takes a lot more time than compared to a regular node project
on aws elasticbeanstalk
is there any way to reduce response time for aws api gateway
There's definitely more information needed to answer this question but from what you've said, your problems may be caused by the cold start time of Lambda functions. An Elastic Beanstalk stack will spin up EC2 instances (which are ready once they're spun up and until they're removed). Lambda will create instances of your handler as needed to address incoming traffic. The first time you call a Lambda, it needs to provision an environment for the function for the first time. Depending on the language used, this can take some time. Successive requests should be faster unless you wait a while (in which case the lambda needs to re-initialize).
So here's more information that would be useful in case this answer is not helpful:
How much slower is Lambda than your Elastic Beanstalk stack?
Is it slower only on the first couple requests or does it continue being slow when you keep requesting?
Is it slow every day or only occasionally?

API Gateway+Lambda+VPC timeout issue

Good morning, Could you please help us with next problem:
I have an API Gateway + Java Lambda Handler. this Lambda uses httpconnection to get some Internet REST API.
when we use this Lambda without VPC it works fine. but when we are using VPC with configured internet access - sometimes Lambda fails with timeout errors. it fails in 20% of all requests (80% requests works fine) with next errors at log.
REPORT RequestId: 16214561-b09a-11e6-a762-7546f12e61bd Duration: 15000.26 ms Billed Duration: 15000 ms Memory Size: 512 MB Max Memory Used: 47 MB
09:57:49
2016-11-22T09:57:49.245Z 16214561-b09a-11e6-a762-7546f12e61bd Task timed out after 15.00 seconds
According to my logs lambda cannot send GET request. I'm not sure where the problem at. Is this Lambda issue, VPC issue or some cofiguration issue.
Also I did try many different REST Api endpoints, so it's definetly not an endpoint issue.
Appreciate any help.
When you place a Lambda function inside your VPC it will not have access to anything outside the VPC. To enable your Lambda function to access resources outside the VPC you have to add a NAT Gateway to your VPC.
The problem is solved.
Lambda VPC configuration had public subnet attached.
Thanks to #Michael-sqlbot
I had pretty much the same issue a few months ago, and here is my solution:
Assuming you set up your Lambda manually, in the Configuration -> Advanced settings you will find the VPC and then choose subnet and security groups.
The Subnet you selected should be in the same subnet with other services the lambda function invokes. In your case, your lambda service uses httpconnection to Internet rest API, that's fine, but you may need DB connection with RDS or triggered by SQS or SNS. So make sure the subnet is correct.
The Security Groups is more important. Again, in your case, you need the access to Internet, so ensure the security group's outbound rules has external connections. Normally, I give all ports and all destination available for simplicity, and of course, you can limit to use port 80 and the API's IP address you need.
Since the executor is "locked" behind a VPC - all internet
communications are blocked.
That results in any http(s) calls to be timed out as they request
packet never gets to the destination.
That is why all actions done by aws-sdk result in a timeout.
Please refer to https://stackoverflow.com/a/39206646
From your log,
Billed Duration: 15000 ms
Memory Size: 512 MB
Max Memory Used: 47 MB
Solution:
It is timeout issue. You need to increase the execution time 15 seconds to 30 seconds or more if necessary.
In some cases, you also need to increase the memory size. It may also make effect. But I think time is the main fact for you, not memory size.
For timing issue and testing issue, you can go through the followings:
Q: How long can an AWS Lambda function execute?
Soluiton: All calls made to AWS Lambda must complete execution within 300 seconds. The default timeout is 3 seconds, but you can set the timeout to any value between 1 and 300 seconds.
To determine why your Lambda function is not working as expected:
You can test your code locally as you would any other Node.js function, or you can test it within the Lambda console using the console's test invoke functionality, or you can use the AWS CLI Invoke command. Each time the code is executed in response to an event, it writes a log entry into the log group associated with a Lambda function, which is /aws/lambda/.
If you see a timeout exceeded error in the log, your timeout setting
exceeds the run time of your function code. This may be because the
timeout is too low, or the code is taking too long to execute.
For solution:
Test your code with different memory settings.
If your code is taking too long to execute, it could be that it does not have enough compute resources to execute its logic. Try increasing the memory allocated to your function and testing the code again, using the Lambda console's test invoke functionality. You can see the memory used, code execution time, and memory allocated in the function log entries. Changing the memory setting can change how you are charged for duration. For information about pricing, see AWS Lambda.
Resource Link:
Troubleshooting and Monitoring AWS Lambda Functions with Amazon
CloudWatch
For testing, a full code example is given here: http://qiita.com/c9katayama/items/b9a30cdfaaa91cba23ad