AWS lambda execution fails only first time I run it with 'customer function error' - amazon-web-services

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?

without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING

Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

Related

GCP Alert Filters Don't Affect Open Incidents

I have an alert that I have configured to send email when the sum of executions of cloud functions that have finished in status other than 'error' or 'ok' is above 0 (grouped by the function name).
The way I defined the alert is:
And the secondary aggregator is delta.
The problem is that once the alert is open, it looks like the filters don't matter any more, and the alert stays open because it sees that the cloud function is triggered and finishes with any status (even 'ok' status keeps it open as long as its triggered enough).
ATM the only solution I can think of is to define a log based metric that will count it itself and then the alert will be based on that custom metric instead of on the built in one.
Is there something that I'm missing?
Edit:
Adding another image to show what I think might be the problem:
From the image above we see that the graph wont go down to 0 but will stay at 1, which is not the way other normal incidents work
According to the official documentation:
"Monitoring automatically closes an incident when it observes that the condition is no longer met or when 7 days have passed without an observation that the condition is still being met."
That made me think that there are times where the condition is not relevant to make it close the incident. Which is confirmed here:
"If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions."
The lack of HTTP requests aren't a reason to close the metric as it keeps using the last recorded value (that triggered the metric).
So, using alerts for Http Requests is fine but you need to close them by yourself. Although I think it would be better to use a custom metric instead if you want them to be disabled automatically.

Intermittent Internal Server Error - StatusCode 500 on API Gateway calling Lambda

I have a REST API in AWS API Gateway that invokes a Python Lambda function and returns some result.
Most of the times this workflow works fine, meaning that the Lambda function is executed and passes the result back to the API, which in turn returns a 200 OK response.
However, there are few times in which I get a 500 error code from the API and the Lambda seems not to be even executed. The response.reason says: "Internal Server Error" and no additional information is given.
There is no difference between the failing requests and the successful ones to the API in terms of the method or parameters format.
One more comment is that the API has the cache setting enabled.
I've seen similar posts and some of the answers mention the format of the JSON object returned by the Lambda function, others point to IAM permissions issues, but none of those seem to be the cause here. In fact, as this post's title says this is an intermittent behavior: most of the times it works fine, but occasionally I get this error.
Any hint would be highly appreciated.
I have the same problem and in my case I had to enable Log full requests/responses data together with INFO logs on the API Gateway stage to see the following logs:
(xxx) Endpoint response body before transformations:
{
"Type": "Service",
"message": "INFO: Lambda is initializing your function. It will be ready to invoke shortly."
}
In my case the issue was related to the fact that the lambda was in Inactive state, which happens If a function remains idle for several weeks.
I have the same problem and I suspect a timeout maybe due to lambda reaching its memory limit.
I have set the memory limit to the next notch (128 -> 512) and augmented the timeout to 10s (default is 3), and now I'm able to see the timeout in action.
I still have the problem for the moment but now I'll be able to investigate.
I hope that this helps you.
I see this with a HTTP API integration. It's intermittent, and it appears to improve when adding provisioned concurrency to the Lambda. For example, on a Lambda that has between 4 and 10 concurrent instances, but usually hovers in the 4 to 8 range, purchasing between 5 and 6 provisioned concurrent instances helped reduce, possibly eliminate, these 500 errors.
I am still monitoring to see whether they are gone for good. The frequency of these errors has gone down drastically with the provisioned instances.

How to change failure message for Alexa?

I want to change the default failure message in Alexa, Sorry, I'm having trouble accessing your {} skill right now.
You cannot change that prompt but you can code to avoid that as much as possible. The error happens when Alexa is not able to get a valid response from your skill endpoint. There can be multiple reasons to that as mentioned here
1. Your endpoint is giving an invalid response
This can be due to the errors/exceptions happening in your endpoint code. You can make sure that error/exceptions don't occur and if they occur, thre is code to catch them and provide a valid response back to Alexa, with an error message of your choice.
2. Your endpoint availability
Make sure that your endpoints are available all the time if you have configured them as an endpoint. This is pretty much guaranteed if you are using Lambda endpoints. But if you are your own hosted web service endpoint, then you must put in all the measures to keep it available for Alexa to communicate with it.
3. Your endpoint response time
Make sure that your endpoint gives back the response within the time period that Alexa expects it to get(guess its 10 seconds). Also make sure if you are using Lambda functions, you have configured them with reasonable execution time to avoid timeout errors.
If you cover the exception/error/availability scenarios well then you can avoid the default error message as much as possible.

Rate Exceeded on AWS Lambda Using API Gateway and serverless framework

When I try to invoke a method that has a HTTP event it results in 500 Internal server error.
On CloudWatch logs it shows Recoverable error occurred (Rate Exceeded.)
When I try invoke a function without lambda it executes with response.
Here is my serverless config:
You have set your Lambda's reservedConcurrency to 0. This will prevent your Lambda from ever being invoked. Setting it to 0 is usually useful when your functions are getting invoked but you're not sure why and you want to stop it right away.
If you want to have it invoked, change reservedConcurrency to a positive integer (by default, it can be a positive integer <= 1000, but you can increase this limit by contacting AWS) or simply remove the reservedConcurrency attribute from your .yml file as it will use the default values.
Why would one ever use reservedConcurrency anyways? Well, let's say your Lambda functions are triggered by requests from API Gateway. Let's say you get 400 (peak hours) requests/second and, upon every request, two other Lambda functions are triggered, one to generate a thumbnail for a given image and one to insert some metadata in DynamoDB. You'd have, in theory, 1200 Lambda functions running at the same time (given all of your Lambda functions finish their execution in less than a second). This would lead to throttling as the default concurrent execution for Lambda functions is 1000. But is the thumbnail generation as important as the requests coming from API Gateway? Very likely not as it's naturally an eventually consistent task, so you could set reservedConcurrency on the thumbnail Lambda to only 200, so you wouldn't use up your concurrency, meaning other functions would be able to spin up to do something more useful at a given point in time (in our example, receiving HTTP requests is more important than generating thumbnails). The other 800 left concurrency could then be split between the function triggered from API Gateway and the one that inserts data into DynamoDB, thus preventing throttling for the important stuff and keeping the not-so-important-stuff eventually consistent.

SWF Activity is not completing even though the computation has finished

I'm testing a new SWF workflow, and I've got some activity that makes a RESTful call out to another service. Problem is, I can see through logging that the actual call takes less than a second to complete, but the Activity always times out in SWF (START_TO_CLOSE of 5 mins). Being more specific, the RESTful call is a list call, and when I limit the batch size to a small number, the Activity completes and moves on very quickly. But at some seemingly arbitrary threshold, it chokes completely.
Does anyone have any insight into this? I've read that SWF calls have a size limitation of 1 MB, does anyone know how to find the size of data my workers are trying to pass SWF?
After some remote debugging, it turns out the response from the task is too big and the activity is failing silently. The failure occurs when the framework tries to report the response back to SWF, and the SDK calls RespondActivityTaskCompleted. That API has a length restriction on the internal result param:
Length Constraints: Maximum length of 32768.
This is a validation error that throws an uncaught exception and is swallowed internally until the Activity times out.
I wouldn't recommend using activity input and output parameters for passing large data sets. SWF is an orchestration technology, not the data passing one. The standard workarounds are:
Storing result in a separate store (S3 for example) and passing reference to it.
Caching result locally on a machine and route all following activities to the same host for them to have access to the cached result. See fileprocessing sample for the details of routing approach.
BTW. Have you checked out Cadence which is an open source version of SWF with much better client side libraries?