API Gateway+Lambda+VPC timeout issue - amazon-web-services

Good morning, Could you please help us with next problem:
I have an API Gateway + Java Lambda Handler. this Lambda uses httpconnection to get some Internet REST API.
when we use this Lambda without VPC it works fine. but when we are using VPC with configured internet access - sometimes Lambda fails with timeout errors. it fails in 20% of all requests (80% requests works fine) with next errors at log.
REPORT RequestId: 16214561-b09a-11e6-a762-7546f12e61bd Duration: 15000.26 ms Billed Duration: 15000 ms Memory Size: 512 MB Max Memory Used: 47 MB
09:57:49
2016-11-22T09:57:49.245Z 16214561-b09a-11e6-a762-7546f12e61bd Task timed out after 15.00 seconds
According to my logs lambda cannot send GET request. I'm not sure where the problem at. Is this Lambda issue, VPC issue or some cofiguration issue.
Also I did try many different REST Api endpoints, so it's definetly not an endpoint issue.
Appreciate any help.

When you place a Lambda function inside your VPC it will not have access to anything outside the VPC. To enable your Lambda function to access resources outside the VPC you have to add a NAT Gateway to your VPC.

The problem is solved.
Lambda VPC configuration had public subnet attached.
Thanks to #Michael-sqlbot

I had pretty much the same issue a few months ago, and here is my solution:
Assuming you set up your Lambda manually, in the Configuration -> Advanced settings you will find the VPC and then choose subnet and security groups.
The Subnet you selected should be in the same subnet with other services the lambda function invokes. In your case, your lambda service uses httpconnection to Internet rest API, that's fine, but you may need DB connection with RDS or triggered by SQS or SNS. So make sure the subnet is correct.
The Security Groups is more important. Again, in your case, you need the access to Internet, so ensure the security group's outbound rules has external connections. Normally, I give all ports and all destination available for simplicity, and of course, you can limit to use port 80 and the API's IP address you need.

Since the executor is "locked" behind a VPC - all internet
communications are blocked.
That results in any http(s) calls to be timed out as they request
packet never gets to the destination.
That is why all actions done by aws-sdk result in a timeout.
Please refer to https://stackoverflow.com/a/39206646

From your log,
Billed Duration: 15000 ms
Memory Size: 512 MB
Max Memory Used: 47 MB
Solution:
It is timeout issue. You need to increase the execution time 15 seconds to 30 seconds or more if necessary.
In some cases, you also need to increase the memory size. It may also make effect. But I think time is the main fact for you, not memory size.
For timing issue and testing issue, you can go through the followings:
Q: How long can an AWS Lambda function execute?
Soluiton: All calls made to AWS Lambda must complete execution within 300 seconds. The default timeout is 3 seconds, but you can set the timeout to any value between 1 and 300 seconds.
To determine why your Lambda function is not working as expected:
You can test your code locally as you would any other Node.js function, or you can test it within the Lambda console using the console's test invoke functionality, or you can use the AWS CLI Invoke command. Each time the code is executed in response to an event, it writes a log entry into the log group associated with a Lambda function, which is /aws/lambda/.
If you see a timeout exceeded error in the log, your timeout setting
exceeds the run time of your function code. This may be because the
timeout is too low, or the code is taking too long to execute.
For solution:
Test your code with different memory settings.
If your code is taking too long to execute, it could be that it does not have enough compute resources to execute its logic. Try increasing the memory allocated to your function and testing the code again, using the Lambda console's test invoke functionality. You can see the memory used, code execution time, and memory allocated in the function log entries. Changing the memory setting can change how you are charged for duration. For information about pricing, see AWS Lambda.
Resource Link:
Troubleshooting and Monitoring AWS Lambda Functions with Amazon
CloudWatch
For testing, a full code example is given here: http://qiita.com/c9katayama/items/b9a30cdfaaa91cba23ad

Related

Resolve Performance Issues with NodeJS AWS Lambda API

I am new to AWS and having some difficulty tracking down and resolving some latency we are seeing on our API. Looking for some help diagnosing and resolving the issue.
Here is what we are seeing:
If an endpoint hasn't been hit recently, then on the first request we see a 1.5-2s delay marked as "Initialization" in the CloudWatch Trace.
I do not believe this is a cold start, because each endpoint is configured to have 1 provisioned concurrency, so we should not get a cold start unless there are 2 simultaneous requests. Also, the billed duration includes this initialization period.
Cold start means when your first request hit to aws lambda it will be prepared container to run your scripts,this will take some time and your request will delay.
When second request hit lambda and lambda and container is already up and runing will be process quickly
https://aws.amazon.com/blogs/compute/operating-lambda-performance-optimization-part-1/
This is the default behavior of cold start, but since you said that you were using provisioned concurrency, that shouldn't happen.
Provisioned concurrency has a delay to activate in the account, you can follow this steps to verify if this lambda used on demand or provisioned concurrency.
AWS Lambda sets an ENV called AWS_LAMBDA_INITIALIZATION_TYPE that contain the values on_demand or provisioned_concurrency.

AWS Lambda function: Timeout after 900 secs

I am invoking an AWS Lambda function locally using aws-sam cli command and I have set the Timeout property to 900 seconds but still it shows function timeout error. However, when I was invoking this function in lambda handler in AWS Console these 900 seconds were enough for the inferencing.
Please help me figure out a solution for this issue and what is the maximum limit I can go for Timeout?
AWS Lambda functions (as at July 2021) can only run for a maximum of 15 minutes (which is 900 seconds).
Some people do 'interesting' things like:
Call another Lambda function to continue the work, or
Use AWS Step Functions to orchestrate multiple AWS Lambda functions
However, it would appear that your use-case is Machine Learning, which does not like to have operations stopped in the middle of processing. Therefore, AWS Lambda is not suitable for your use-case.
Instead, I would recommend using Amazon EC2 spot instances, which will likely be lower-cost for your use-case. While spot instances might occasionally be terminated, your use-case can probably handle the need to re-run some processing if this happens.

Lambda execution time out after 15 mins what I can do?

I have a script running on Lambda, I've set the timeout to maximum 15 mins but it's still giving me time out error, there is not much infomation in the lofs, how I can solve this issue and spot what is taking soo much time? I tested the script locally and it's fairly quick.
Here's the error:
{
"errorMessage": "2020-09-10T18:26:53.180Z xxxxxxxxxxxxxxx Task timed out after 900.10 seconds"
}
If you're exceeding the 15 minutes period there are a few things you should check to identify:
Is the Lambda connecting to resources in a VPC? If so is it attached via VPC config, and do the target resources allow inbound access from the Lambda.
Is the Lambda connecting to a public IP but using VPC configuration? If so it will need a NAT attached to allow outbound access.
Are there any long running processes as part of your Lambda?
Once you've ruled these out consider increasing the available resources of your Lambda, perhaps its hitting a cap and is therefore performing slow. Increasing the memory will also increase the available CPU for you.
Adding comments in the code will log to CloudWatch logs, these can help you identify where in the code the slowness starts. This is done by simply calling the general output/debug function of your language i.e. print() in Python or console.log() in NodeJS.
If the function is still expected to last longer than 15 minutes after this you will need to break it down into smaller functions performing logical segments of the operation
A suggested orchestrator for this would be to use a step function to handle the workflow for each stage. If you need shared storage between each Lambda you can make use of EFS to be attached to all of your Lambdas so that they do not need to upload/download between the operations.
Your comment about it connecting to a SQL DB is likely the key. I assume that DB is in AWS in your VPC. This requires particular setup. Check out
https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
https://docs.aws.amazon.com/lambda/latest/dg/services-rds-tutorial.html
Another thing you can do is enable debug level logging and then look at the details in CloudWatch after trying to run it. You didn't mention which language your lambda uses, so how to do this could be different for the language you use. Here's how it would be done in python:
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.getLevelName('DEBUG'))

AWS Lambda inside VPC. 504 Gateway Timeout (ENI?)

I have a Serverless .net core web api lambda application deployed on AWS.
I have this sitting inside a VPC as I access ElasticSearch service inside that same VPC.
I have two API microservices that connect to the Elasticsearch service.
After a period of non use (4 hours, 6 hours, 18 hours - I'm not sure exactly but seems random), the function becomes unresponsive and I get a 504 gateway timeout error, "Endpoint cannot be found"
I read somewhere that if "idle" for too long, the ENI is released back into the AWS system and that triggering the Lambda again should start it up.
I can't seem to "wake" up the function by calling it as it keeps timing out with the above error (I have also increased the timeouts from default).
Here's the kicker - If I make any changes to the specific lambda function, and save those changes (this includes something as simple as changing the timeout value) - My API calls (BOTH of them, even though different lambdas) start working again like it has "kicked" back in. Obviously the changes do this, but why?
Obviously I don't want timeouts in a production environment regardless of how much, OR how little the lambda or API call is used.
I need a bulletproof solution to this. Surely it's a config issue of some description but I'm just not sure where to look.
I have altered Route tables, public/private subnets, CIDR blocks, created internet gateways, NAT etc. for the VPC. This all works, but these two lambdas, that require VPC access, keeps falling "asleep" somehow.
The is because of Cold Start of Lambda.
There is a new feature which was release in reInvent 2019, where in there is a provisioned concurrency for lambda (don't get confused with reserved concurrency).
Ensure the provisioned concurrency to minimum 1 (or the amount of requests to be served in parallel) to have lambda warm always and serve requests
Ref: https://aws.amazon.com/blogs/aws/new-provisioned-concurrency-for-lambda-functions/
To get more context, Lambda in VPC uses hyperplane ENI and functions in the same account that share the same security group:subnet pairing use the same network interfaces.
If Lambda functions in an account go idle for sometime (typically no usage for 40 mins across all functions using that ENI, as I got this time info from AWS support), the service will reclaim the unused Hyperplane resources and so very infrequently invoked functions may still see longer cold-start times.
Ref: https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/

How does an AWS Lambda function scale inside a VPC subnet?

I understand the AWS Lambda is a serverless concept wherein a piece of code can be triggered on some event.
I want to understand how does the Lambda handle scaling?
For eg. if my Lambda function sits inside a VPC subnet as it wants to access VPC resources, and that the subnet has a CIDR of 192.168.1.0/24, which would result in 251 available IPs after subtracting the AWS reserved 5 IPs
Would that mean if my AWS Lambda function gets 252 invocations at the exact same time,Only 251 of the requests would be served and 1 would either timeout or will get executed once one of the 252 functions completes execution?
Does the Subnet size matter for the AWS Lambda scaling?
I am following this reference doc which mentions concurrent execution limits per region,
Can I assume that irrespective of whether an AWS Lambda function is No VPC or if it's inside a VPC subnet, it will scale as per mentioned limits in the doc?
Vladyslav's answer is still technically correct (Subnet size does matter), but things have changed significantly since it was written and subnet size is much less of a consideration. See aws' announcement:
Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
Your function scaling is no longer directly tied to the number of network interfaces and Hyperplane ENIs can scale to support large numbers of concurrent function executions
Yes, you are right. Subnet size definitely does matter, you have to be careful with your CIDR blocks. With that one last invocation (252nd), it depends on the way your lambda is invoked: synchronously (e.g. API Gateway) or asynchronously (e.g. SQS). If it is called synchronously, it'll be just throttled and your API will respond with 429 HTTP status, which stands for "too many requests". If it is asynchronous, it'll be throttled and will be retried within a six hour period window. More detailed description you can find on this page.
Also I recently published a post in my blog, which is related to your question. You may find it useful.