API Gateway occasionally spikes 5XX errors in Production

API Gateway occasionally spikes 5XX errors in Production - amazon-web-services

Our API Gateway and Lambdas are regularly used and work just fine most of the time, however we see spikes in 5XX errors now and then which causes a spike in customer complaints and other issues. When I look at the logs during this time I see a flood of the following error:
Execution failed due to configuration error: Malformed Lambda proxy response
There are no other details beyond this. After 10 or 15 minutes it will go away along with customer complaints. I've read that it may happen if you exceed your concurrency limit, but looking at the dashboard and it doesn't look like we ever break above 150 concurrent executions.
The calls themselves being hit work consistently as well, aside from these random spikes in 5XXs.
What else might be causing this inconsistency?
Looking through logs to try and get this figured out. I have made the logs as verbose as possible and there is nothing there. We'll have a normal call with a success response then a few minutes later this error comes up with no other logging, just the error alone. Then a few minutes after that we have logs starting for the next successful call.
10:25:42 Successfully completed execution
10:25:42 Method completed with status: 200
10:42:01 Execution failed due to configuration error: Malformed Lambda
proxy response
12:21:21 Successfully completed execution
12:21:21 Method completed with status: 200
Logging can't go further because the lambdas are never even executed. So we have no details on the payload sent to it, or any internal logging for the call, etc. It just immediately fails at the API Gateway level.
Edit: We still get these spikes but we are working on splitting the lambdas out more. We have an ExpressJS app that handles the lion's share of all requests. So we are breaking more off, especially high traffic requests, into their own lambdas to see if this helps. In-case there is an issue where a container gets too backlogged or times-out because it was handling long running requests (that takes upwards of 20s) as well as being hammered by requests that finish < 500ms.
Other theory is that maybe there is an error that gets triggered somewhere that kills the process or something else and that container is bad until it gets destroyed and respawned. As these spike and then go away in a few minutes. So breaking the lambdas up more should reduce the odds of errors from one cascading and impacting all other requests.
We also increase the resources of the lambda to see if that would help with it handling so many requests.

This usually happens when there is a timeout with your call and if there is a delay with your lambda execution.
If you are accessing an external resource such as RDS or an external network call, wrap that with a promise and handle with a timeout. This way you can identify which resource is having a bottleneck or taking a long time to execute.
exports.handler = function(event, context, callback) {
var response = {}; // set the response object
var err = "An error occured";
setTimeout(function () {
callback(err, response);
}, 3000); // 3000 ms is the timeout
}
// Actual code here
};
Also, check for any missing callbacks. That will also cause this issue.
Hope this helps.

Related

Intermittent Internal Server Error - StatusCode 500 on API Gateway calling Lambda

I have a REST API in AWS API Gateway that invokes a Python Lambda function and returns some result.
Most of the times this workflow works fine, meaning that the Lambda function is executed and passes the result back to the API, which in turn returns a 200 OK response.
However, there are few times in which I get a 500 error code from the API and the Lambda seems not to be even executed. The response.reason says: "Internal Server Error" and no additional information is given.
There is no difference between the failing requests and the successful ones to the API in terms of the method or parameters format.
One more comment is that the API has the cache setting enabled.
I've seen similar posts and some of the answers mention the format of the JSON object returned by the Lambda function, others point to IAM permissions issues, but none of those seem to be the cause here. In fact, as this post's title says this is an intermittent behavior: most of the times it works fine, but occasionally I get this error.
Any hint would be highly appreciated.

I have the same problem and in my case I had to enable Log full requests/responses data together with INFO logs on the API Gateway stage to see the following logs:
(xxx) Endpoint response body before transformations:
{
"Type": "Service",
"message": "INFO: Lambda is initializing your function. It will be ready to invoke shortly."
}
In my case the issue was related to the fact that the lambda was in Inactive state, which happens If a function remains idle for several weeks.

I have the same problem and I suspect a timeout maybe due to lambda reaching its memory limit.
I have set the memory limit to the next notch (128 -> 512) and augmented the timeout to 10s (default is 3), and now I'm able to see the timeout in action.
I still have the problem for the moment but now I'll be able to investigate.
I hope that this helps you.

I see this with a HTTP API integration. It's intermittent, and it appears to improve when adding provisioned concurrency to the Lambda. For example, on a Lambda that has between 4 and 10 concurrent instances, but usually hovers in the 4 to 8 range, purchasing between 5 and 6 provisioned concurrent instances helped reduce, possibly eliminate, these 500 errors.
I am still monitoring to see whether they are gone for good. The frequency of these errors has gone down drastically with the provisioned instances.

AWS lambda execution fails only first time I run it with 'customer function error'

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?

without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING

Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

AWS Lambda execution duration randomly spikes and causes time-outs

I'm building a server-less web-tracking system which serves its tracking pixel using AWS API Gateway, which calls a Lambda function whenever a tracking request arrives to write the tracking event into a Kinesis stream.
The Lambda function itself does not do anything fancy. It just a takes the incoming event (its own argument) and writes it to the stream. Essentially, it's just:
import boto3
kinesis_client = boto3.client("kinesis")
kinesis_stream = "my_stream_name"
def return_tracking_pixel(event, context):
...
new_record = ...(event)
kinesis_client.put_record(
StreamName=kinesis_stream,
Data=new_record,
PartitionKey=...
)
return ...
Sometimes I experience a weird spike in the Lambda execution duration that causes some of my Lambda function invocations to time-out and the tracking requests to be lost.
This is the graph of 1-minute invocation counts of the Lambda function in the in affected time period:
Between 20:50 and 23:10 I suddenly see many invocation errors (1-minute error counts):
which are obviously caused by the Lambda execution time-out (maximum duration in 1-minute intervals):
There is nothing weird going on neither with my Kinesis stream (data-in, number of put records, put_record success count etc., all looks normal), nor with my API GW (number of invocations corresponds to number of API GW calls, well within the limits of the API GW).
What could be causing the sudden (and seemingly randomly occurring) spike in the Lambda function execution duration?
EDIT: neither the lambda functions are being throttled, which was my first idea.

Just to add my 2 cents, because there's not much investigative work without extra logging or some X-Ray analysis.
AWS Lambda sometimes will force recycle containers which will feel like cold starts even though your function is being reasonably exercised and warmed up. This might bring all cold start related issues, like extra delays for ENIs if your Lambda has an attached VPC and so on... but even for a simple function like yours, 1 second timeout is sometimes too optimistic for a cold start.
I don't know of any documentation on those forced recycles, other than some people having evidence for it.
"We see a forced recycle about 7 times a day." source
"It also appears that even once warmed, high concurrency functions get recycled much faster than those with just a few in memory." source
I wonder how you could confirm this is the case. Perhaps you could check those errors appearing in Cloud Watch log streams to be from containers that never appeared before.

Catching timeout errors in AWS Api Gateway

Since Api Gateway time limit is 10 seconds to execute any request I'm trying to deal with timeout errors, but a haven't found a way to catch and respond a custom message.
Context of the problem: I have a function that takes less than 2 seconds to execute, but when the function performs a cold start sometimes it takes more than 10 seconds creating a connection with DynamoDB in Java. I've already optimize my function using threads but I still cannot keep between the 10-seconds limit for the initial call.
I need to find a way to deliver a response model like this:
{
"error": "timeout"
}
To find a solution I created a function in Lambda that intentionally responds something after 10 seconds of execution. Doing the integration with Api Gateway I'm getting this response:
Request: /example/lazy
Status:
Latency: ms
Response Body
{
"logref": "********-****-****-****-1d49e75b73de",
"message": "Timeout waiting for endpoint response"
}
In documentation I found that you can catch this errors using HTTP status regex in Integration Response. But I haven't find a way to do so, and it seems that nobody on the Internet is having my same problem, as I haven't find this specific message in any forum.
I have tried with these regex:
.*"message".*
Timeout.*
.*"status":400.*
.*"status":404.*
.*"status":504.*
.*"status":500.*
Anybody knows witch regex I should use to capture this "message": "Timeout... ?

You are using Test Invoke feature from console which has a timeout limit of 10 seconds. But, the deployed API's timeout is 30 seconds as mentioned here. So, that should be good enough to handle Lambda cold start case. Please deploy and then test using the api link. If that times out because your endpoint takes more than 30 seconds, the response would be:
{"message": "Endpoint request timed out"}
To clarify, you can configure your method response based on the HTTP status code of integration response. But in case of timeout, there is no integration response. So, you cannot use that feature to configure the method response during timeout.

You can improve the cold start time by allocating more memory to your Lambda function. With the default 512MB, I am seeing cold start times of 8-9 seconds for functions written in Java. This improves to 2-3 seconds with 1536MB of memory.
Amazon says that it is the CPU allocation that is really important, but there is not way to directly increase it. CPU allocation increases proportionately to memory.
And if you want close to zero cold start times, keeping the function warm is the way to go, as described here.

Amazon API gateway timeout

I have some issue with API gateway. I made a few API methods, sometimes they work longer than 10 seconds and Amazon returns 504 error. Here is screenshot below:
Please help! How can I increase timeout?
Thanks!

Right now the default limit for Lambda invocation or HTTP integration is 30s according to http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html and this limit is not configurable.

As of Dec/2017, the maximum value is still 29 seconds, but should be able to customize the timeout value.
https://aws.amazon.com/about-aws/whats-new/2017/11/customize-integration-timeouts-in-amazon-api-gateway/
This can be set in "Integration Request" of each method in APIGateway.

Finally in 2022 we have a workaround. Unfortunately AWS did not change the API Gateway so that's still 29 seconds but, you can use a built-in HTTPS endpoint in the lambda itself: Built-in HTTPS Endpoints for Single-Function Microservices
which is confirmed to have no timeout-so essentially you can have the full 15 minute window of lambda timeout: https://twitter.com/alex_casalboni/status/1511973229740666883
For example this is how you define a function with the http endpoint using aws-cdk and typescript:
const backendApi = new lambda.Function(this, 'backend-api', {
memorySize: 512,
timeout: cdk.Duration.seconds(40),
runtime: lambda.Runtime.NODEJS_16_X,
architecture: Architecture.ARM_64,
handler: 'lambda.handler',
code: lambda.Code.fromAsset(path.join(__dirname, '../dist')),
environment: {
...parsedDotenv
}
})
backendApi.addFunctionUrl({
authType: lambda.FunctionUrlAuthType.NONE,
cors: {
// Allow this to be called from websites on https://example.com.
// Can also be ['*'] to allow all domain.
allowedOrigins: ['*']
}
})

You can't increase the timeout, at least not now. Your endpoints must complete in 10 seconds or less. You need to work on improving the speed of your endpoints.
http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

Lambda functions will timeout after a max. of 5 min; API Gateway requests will timeout after 29 sec. You can't change that, but you can workaround it with asynchronous execution pattern, I wrote I blog post about:
https://joarleymoraes.com/serverless-long-running-http-requests/

I wanted to comment on "joarleymoraes" post but don't have enough reputation. The only thing to add to that is that you don't HAVE to refactor to use async, it just depends on your backend and how you can split it up + your client side retries.
If you aren't seeing a high percentage of 504's and you aren't ready for async processing, you can implement client side retries with exponential backoff on them so they aren't permanent failures.
The AWS SDK automatically implements retries with backoff, so it can help to make it easier, especially since Lambda Layers will allow you to maintain the SDK for your functions without having to constantly update your deployment packages.
Once you do that it will result in less visibility into those timeouts, since they are no longer permanent failures. This can buy you some time to deal with the core problem, which is that you are seeing 504's in the first place. That certainly can mean refactoring your code to be more response, splitting up large functions into more "micro service" type concepts and reducing external network calls.
The other benefit to retries is that if you retry all 5xx responses from an application, it can cover a lot of different issues which you might see during normal execution. It is generally considered in all applications that these issues are never 100% avoidable so it's best practice to go ahead and plan for the worst!
All of that being said, you should still work on reducing the lambda execution time or going async. This will allow you to set your timeout values to a much smaller number, which allows you to fail faster. This helps a lot for reducing the impact on the front end, since it doesn't have to wait 29 seconds to retry a failed request.

Timeouts can be decreased but cannot be increased more than 29 seconds. The backend on your method should return a response before 29 seconds else API gateway will throw 504 timeout error.
Alternatively, as suggested in some answers above, you can change the backend to send status code 202 (Accepted) meaning the request has been received successfully and the backend then continues further processing. Of course, we need to consider the use case and it's requirements before implementing the workaround

Lambda functions have 15 mins of max execution time, but since APIGateway has strict 29 second timeout policy, you can do following things to over come this.
For an immediate fix, try increasing your lambda function size. Eg.: If your lambda function has 128 MB memory, you can increase it to 256 MB. More memory helps function to execute faster.
OR
You can use lambdaInvoke() function which is part of the "aws-sdk". With lambdaInvoke() instead of going through APIGateway you can directly call that function. But this is useful on server side only.
OR
The best method to tackle this is -> Make request to APIGateway -> Inside the function push the received data into an SQS Queue -> Immediately return the response -> Have a lambda function ready which triggers when data available in this SQS Queue -> Inside this triggered function do your actual time complex executions -> Save the data to a data store -> If call is comes from client side(browser/mobile app) then implement long-polling to get the final processed result from the same data store.
Now since api is immediately returning the response after pushing data to SQS, your main function execution time will be much less now, and will resolve the APIGateway timeout issue.
There are other methods like using WebSockets, Writing event driven code etc. But above methods are much simpler to implement and manage.

While you cannot increase the timeout, you can link lambda's together if the work is something that could be split up.
Using the aws sdk:
var aws = require('aws-sdk');
var lambda = new aws.Lambda({
region: 'us-west-2' //change to your region
});
lambda.invoke({
FunctionName: 'name_of_your_lambda_function',
Payload: JSON.stringify(event, null, 2) // pass params
}, function(error, data) {
if (error) {
context.done('error', error);
}
if(data.Payload){
context.succeed(data.Payload)
}
});
Source: Can an AWS Lambda function call another
AWS Documentation: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html

As of May 21, 2021 This is still the same. The hard limit for the maximum time is 30 seconds. Below is the official document on quotas for API gateway.
https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html#http-api-quotas

The timeout limits cannot be increased so a response should be returned within 30 seconds. The workaround I usually do :
Send the result in an async way. The lambda function should trigger
another process and sends a response to the client saying
Successfully started process X and the other process should notify
the client in async way once it finishes (Hit an endpoint, Send a
slack notification or an email..). You can found a lot of interesting resources concerning this topic
Utilize the full potential of the multiprocessing in your lambda
function and increase the memory for a faster computing time
Eventually, if you need to return the result in a sync way and one
lambda function cannot do the job, you could integrate API gateway
directly with step function so you would have multiple lambda
function working in parallel. It may seem complicated but in fact it
is quite simple

Custom timeout between 50 and 29,000 milliseconds for WebSocket APIs and between 50 and 30,000 milliseconds for HTTP APIs. The default timeout is 29 seconds for WebSocket APIs and 30 seconds for HTTP APIs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js