Intermittent Internal Server Error - StatusCode 500 on API Gateway calling Lambda - amazon-web-services

I have a REST API in AWS API Gateway that invokes a Python Lambda function and returns some result.
Most of the times this workflow works fine, meaning that the Lambda function is executed and passes the result back to the API, which in turn returns a 200 OK response.
However, there are few times in which I get a 500 error code from the API and the Lambda seems not to be even executed. The response.reason says: "Internal Server Error" and no additional information is given.
There is no difference between the failing requests and the successful ones to the API in terms of the method or parameters format.
One more comment is that the API has the cache setting enabled.
I've seen similar posts and some of the answers mention the format of the JSON object returned by the Lambda function, others point to IAM permissions issues, but none of those seem to be the cause here. In fact, as this post's title says this is an intermittent behavior: most of the times it works fine, but occasionally I get this error.
Any hint would be highly appreciated.

I have the same problem and in my case I had to enable Log full requests/responses data together with INFO logs on the API Gateway stage to see the following logs:
(xxx) Endpoint response body before transformations:
{
"Type": "Service",
"message": "INFO: Lambda is initializing your function. It will be ready to invoke shortly."
}
In my case the issue was related to the fact that the lambda was in Inactive state, which happens If a function remains idle for several weeks.

I have the same problem and I suspect a timeout maybe due to lambda reaching its memory limit.
I have set the memory limit to the next notch (128 -> 512) and augmented the timeout to 10s (default is 3), and now I'm able to see the timeout in action.
I still have the problem for the moment but now I'll be able to investigate.
I hope that this helps you.

I see this with a HTTP API integration. It's intermittent, and it appears to improve when adding provisioned concurrency to the Lambda. For example, on a Lambda that has between 4 and 10 concurrent instances, but usually hovers in the 4 to 8 range, purchasing between 5 and 6 provisioned concurrent instances helped reduce, possibly eliminate, these 500 errors.
I am still monitoring to see whether they are gone for good. The frequency of these errors has gone down drastically with the provisioned instances.

Related

How to change failure message for Alexa?

I want to change the default failure message in Alexa, Sorry, I'm having trouble accessing your {} skill right now.
You cannot change that prompt but you can code to avoid that as much as possible. The error happens when Alexa is not able to get a valid response from your skill endpoint. There can be multiple reasons to that as mentioned here
1. Your endpoint is giving an invalid response
This can be due to the errors/exceptions happening in your endpoint code. You can make sure that error/exceptions don't occur and if they occur, thre is code to catch them and provide a valid response back to Alexa, with an error message of your choice.
2. Your endpoint availability
Make sure that your endpoints are available all the time if you have configured them as an endpoint. This is pretty much guaranteed if you are using Lambda endpoints. But if you are your own hosted web service endpoint, then you must put in all the measures to keep it available for Alexa to communicate with it.
3. Your endpoint response time
Make sure that your endpoint gives back the response within the time period that Alexa expects it to get(guess its 10 seconds). Also make sure if you are using Lambda functions, you have configured them with reasonable execution time to avoid timeout errors.
If you cover the exception/error/availability scenarios well then you can avoid the default error message as much as possible.

Rate Exceeded on AWS Lambda Using API Gateway and serverless framework

When I try to invoke a method that has a HTTP event it results in 500 Internal server error.
On CloudWatch logs it shows Recoverable error occurred (Rate Exceeded.)
When I try invoke a function without lambda it executes with response.
Here is my serverless config:
You have set your Lambda's reservedConcurrency to 0. This will prevent your Lambda from ever being invoked. Setting it to 0 is usually useful when your functions are getting invoked but you're not sure why and you want to stop it right away.
If you want to have it invoked, change reservedConcurrency to a positive integer (by default, it can be a positive integer <= 1000, but you can increase this limit by contacting AWS) or simply remove the reservedConcurrency attribute from your .yml file as it will use the default values.
Why would one ever use reservedConcurrency anyways? Well, let's say your Lambda functions are triggered by requests from API Gateway. Let's say you get 400 (peak hours) requests/second and, upon every request, two other Lambda functions are triggered, one to generate a thumbnail for a given image and one to insert some metadata in DynamoDB. You'd have, in theory, 1200 Lambda functions running at the same time (given all of your Lambda functions finish their execution in less than a second). This would lead to throttling as the default concurrent execution for Lambda functions is 1000. But is the thumbnail generation as important as the requests coming from API Gateway? Very likely not as it's naturally an eventually consistent task, so you could set reservedConcurrency on the thumbnail Lambda to only 200, so you wouldn't use up your concurrency, meaning other functions would be able to spin up to do something more useful at a given point in time (in our example, receiving HTTP requests is more important than generating thumbnails). The other 800 left concurrency could then be split between the function triggered from API Gateway and the one that inserts data into DynamoDB, thus preventing throttling for the important stuff and keeping the not-so-important-stuff eventually consistent.

AWS lambda execution fails only first time I run it with 'customer function error'

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?
without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING
Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

API Gateway occasionally spikes 5XX errors in Production

Our API Gateway and Lambdas are regularly used and work just fine most of the time, however we see spikes in 5XX errors now and then which causes a spike in customer complaints and other issues. When I look at the logs during this time I see a flood of the following error:
Execution failed due to configuration error: Malformed Lambda proxy response
There are no other details beyond this. After 10 or 15 minutes it will go away along with customer complaints. I've read that it may happen if you exceed your concurrency limit, but looking at the dashboard and it doesn't look like we ever break above 150 concurrent executions.
The calls themselves being hit work consistently as well, aside from these random spikes in 5XXs.
What else might be causing this inconsistency?
Looking through logs to try and get this figured out. I have made the logs as verbose as possible and there is nothing there. We'll have a normal call with a success response then a few minutes later this error comes up with no other logging, just the error alone. Then a few minutes after that we have logs starting for the next successful call.
10:25:42 Successfully completed execution
10:25:42 Method completed with status: 200
10:42:01 Execution failed due to configuration error: Malformed Lambda
proxy response
12:21:21 Successfully completed execution
12:21:21 Method completed with status: 200
Logging can't go further because the lambdas are never even executed. So we have no details on the payload sent to it, or any internal logging for the call, etc. It just immediately fails at the API Gateway level.
Edit: We still get these spikes but we are working on splitting the lambdas out more. We have an ExpressJS app that handles the lion's share of all requests. So we are breaking more off, especially high traffic requests, into their own lambdas to see if this helps. In-case there is an issue where a container gets too backlogged or times-out because it was handling long running requests (that takes upwards of 20s) as well as being hammered by requests that finish < 500ms.
Other theory is that maybe there is an error that gets triggered somewhere that kills the process or something else and that container is bad until it gets destroyed and respawned. As these spike and then go away in a few minutes. So breaking the lambdas up more should reduce the odds of errors from one cascading and impacting all other requests.
We also increase the resources of the lambda to see if that would help with it handling so many requests.
This usually happens when there is a timeout with your call and if there is a delay with your lambda execution.
If you are accessing an external resource such as RDS or an external network call, wrap that with a promise and handle with a timeout. This way you can identify which resource is having a bottleneck or taking a long time to execute.
exports.handler = function(event, context, callback) {
var response = {}; // set the response object
var err = "An error occured";
setTimeout(function () {
callback(err, response);
}, 3000); // 3000 ms is the timeout
}
// Actual code here
};
Also, check for any missing callbacks. That will also cause this issue.
Hope this helps.

Catching timeout errors in AWS Api Gateway

Since Api Gateway time limit is 10 seconds to execute any request I'm trying to deal with timeout errors, but a haven't found a way to catch and respond a custom message.
Context of the problem: I have a function that takes less than 2 seconds to execute, but when the function performs a cold start sometimes it takes more than 10 seconds creating a connection with DynamoDB in Java. I've already optimize my function using threads but I still cannot keep between the 10-seconds limit for the initial call.
I need to find a way to deliver a response model like this:
{
"error": "timeout"
}
To find a solution I created a function in Lambda that intentionally responds something after 10 seconds of execution. Doing the integration with Api Gateway I'm getting this response:
Request: /example/lazy
Status:
Latency: ms
Response Body
{
"logref": "********-****-****-****-1d49e75b73de",
"message": "Timeout waiting for endpoint response"
}
In documentation I found that you can catch this errors using HTTP status regex in Integration Response. But I haven't find a way to do so, and it seems that nobody on the Internet is having my same problem, as I haven't find this specific message in any forum.
I have tried with these regex:
.*"message".*
Timeout.*
.*"status":400.*
.*"status":404.*
.*"status":504.*
.*"status":500.*
Anybody knows witch regex I should use to capture this "message": "Timeout... ?
You are using Test Invoke feature from console which has a timeout limit of 10 seconds. But, the deployed API's timeout is 30 seconds as mentioned here. So, that should be good enough to handle Lambda cold start case. Please deploy and then test using the api link. If that times out because your endpoint takes more than 30 seconds, the response would be:
{"message": "Endpoint request timed out"}
To clarify, you can configure your method response based on the HTTP status code of integration response. But in case of timeout, there is no integration response. So, you cannot use that feature to configure the method response during timeout.
You can improve the cold start time by allocating more memory to your Lambda function. With the default 512MB, I am seeing cold start times of 8-9 seconds for functions written in Java. This improves to 2-3 seconds with 1536MB of memory.
Amazon says that it is the CPU allocation that is really important, but there is not way to directly increase it. CPU allocation increases proportionately to memory.
And if you want close to zero cold start times, keeping the function warm is the way to go, as described here.