Aws lambda retry behavior when triggered by cloudwatch event - amazon-web-services

I have created a lambda function which is triggered through cloudwatch event cron.
While testing I found that lambda retry is not working in case of timeout.
I want to understand what is the expected behaviour.Should retry happen in case of timeout?
P.S I have gone through the document on the aws site but still can't figure out
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html

Found the aws documentation on this,
"Error handling for a given event source depends on how Lambda is invoked. Amazon CloudWatch Events is configured to invoke a Lambda function asynchronously."
"Asynchronous invocation – Asynchronous events are queued before being used to invoke the Lambda function. If AWS Lambda is unable to fully process the event, it will automatically retry the invocation twice, with delays between retries."
So the retry should happen in this case. Not sure what was wrong with my lambda function , I just deleted and created again and retry worked this time.

Judging from the docs you linked to it seems that the lambda function is called again if it has timed out and the timeout is because it is waiting for another resource (i.e. is blocked by network):
The function times out while trying to reach an endpoint.
As a cron event is not stream based (if it is synchronous or asynchronous seems not be be clear from the docs) it will be retried.

CloudWatch Event invokes a Lambda function asynchronously.
For asynchronous invocation, Lambda manages the function's asynchronous event queue and attempts to retry two more times on errors including timeout.
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
So with the default configuration, your function should retry with timeout errors. If it doesn't, there might be some other reasons as follows:
The function doesn't have enough concurrency to run and events are throttled. Check function's reserved concurrency setting. It should be at least 1.
When above happens, events might also be deleted from the queue without being sent to the function. Check function's asynchronous invocation setting, make sure it has enough age to keep the events in the queue and retry attempts is not zero.

Related

Why does SQS FIFO queue with lambda trigger cannot guarantee only once delivery?

I came across an AWS article where it is mentioned only once delivery of a message is not guaranteed when the FIFO queue is used with a lambda trigger.
Amazon SQS FIFO queues ensure that the order of processing follows the message order within a message group. However, it does not guarantee only once delivery when used as a Lambda trigger. If only once delivery is important in your serverless application, it’s recommended to make your function idempotent. You could achieve this by tracking a unique attribute of the message using a scalable, low-latency control database like Amazon DynamoDB.
I am more interested in knowing the reason behind this behaviour when it comes to lambda trigger. I assume, with standard queues only once delivery is not guaranteed since SQS stores messages in multiple servers for redundancy and high availability and there is a chance of same message getting delivered again while multiple lambdas polling the queue.
Can someone please explain the reason for the same behaviour in FIFO queue with lambda trigger or the working internally?
By default lambda polls synchronously from SQS. So when lambda processes messages from the queue they become invisible i.e Visibility timeout gets triggered till the lambda either finishes the process to eventually delete them from the queue or fails to retry them again.
That's why lambda cannot guarantee exactly-once delivery since there can be a retry in lambda cause of timeout (15min max) or other code dependency errors.
To prevent this you either make your process idempotent or use Batch response to delete the message even in case of failure.

How to monitor+report failures of async invoked concurrent lambdas?

I have this lambda in which the first instance invokes itself 30-40 times to process data concurrently. Invocation happens using the async fire and forget Event invocation type. The very first instance obviously dies after invocation completes.
I want the first lambda to stay alive after invocation and report on number of instances triggered and if any lambdas failed through SNS notifications. So I switched to RequestResponse invocation type, but the problem here is now my lambda invokes one instance waits for the response from the instance (which can take minutes) then invokes the next one.
How can I invoke lambdas asynchronously but still get the reporting and tracking from the first instance?
You can report on the number of instances triggered using the first Lambda invocation.
For failures, it is not possible to use the same Lambda instance, as the error may happen much further down the line (as you mentioned, it may take minutes to complete). However, you can configure the parallelly running instances to report on the invocation status.
In fact, there is a built-in feature for this. Refer to this link. That is, you can send the invocation records (request and response information) to SQS. By consuming the queue, you should be able to get a report of success and failed instances.

How can I ensure that a downstream API called by Lambda integrated with SQS is called at least 2 times before message going to DLQ?

I have lambda using SQS events as inputs. The SQS queue also has a DLQ.
The lambda function invokes a downstream Restful API (call this operation DoPostToAPI())
I need to guarantee that the lambda function attempts to call DoPostToAPI() at least 2 times (before message goes to DLQ)
What configuration of Lambda Retries and SQS Redrive policy would I need to set in order to accomplish the above requirement?
I need to be 100% certain that messages that arrive on the DLQ only arrive because they have attempted to been sent to downstream API DoPostToAPI() 2 times, and that messages dont arrive in DLQ for any other reason, if possible.
To me, it makes sense that messages should only arrive on the DLQ if the operation was attempted, and not for other reasons (i.e. I dont want messages to arrive on DLQ purely because of throttling, since the DoPostToAPI() should be attempted first before sending to DLQ) Why would I want messages on DLQ if the lambda function operation wasnt even attempted? In order words, I need the lambda operation to be guaranteed to be invoked before item moves to DLQ.
Can I get some help on this? Is it possible to guarantee that messages on the DLQ have arrived because of failed DoPostToAPI() api calls? Or is it (more unfortunate) possible that messages arrive on DLQ for reasons other than failed calls to downstream API?
From what I have read online so far, its possible that lambda , after doing receive on SQS message and moving the message to invisibile on the queue, could run into throttling issues and re-attempt the lambda invocation. But if it runs into lambda throttling again, it could end up back on main queue, which if it reaches its max receive count, could place the message on the DLQ without the lambda having been attempted at all. Is this correct?
For simplicity lets imagine the following inputs
SQSQueue1
SQSQueue1DLQ
LambdaFunction1 --> ServiceClient1.DoPostToAPI()
What is the interplay between the lambda "maximum_retry_attempts" and the SQS redrive_policy "maxReceiveCount"
In order to ensure your lambda attempts retries when using SQS, You only need set the SQS property
maxReceiveCount
This value controls how many lambda invocations will be attempted for a given batch before a message goes to the Dead Letter queue.
Unfortunately, the lambda property
maximum_retry_attempts
Does not apply for lambda functions using SQS as function event trigger.

Why is Lambda throttled when invoking from SQS?

I have an SQS queue that is used as an event source for a Lambda function. Due to DB connection limitations, I have set a maximum concurrency to 5 for the Lambda function.
Under normal circumstances, everything works fine, but when we need to make changes, we deliberately disable the SQS trigger. Messages start to back up in the SQS queue as expected.
When the trigger is re-enabled, 5 Lambda functions are instantiated, and start to process the messages in the queue, however I also see CloudWatch telling me that the Lambda is being throttled.
Please could somebody explain why this is happening? I expect the available Lambda functions to simply work through the backlog as fast as they can, and wouldn't expect to see throttling due to the queue.
This is expected behaviour.
"On reaching the concurrency limit associated with a function, any further invocation requests to that function are throttled, i.e. the invocation doesn't execute your function. Each throttled invocation increases the Amazon CloudWatch Throttles metric for the function"
https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html

Can an AWS lambda be set up to retry if it fails?

I have an AWS lambda function that will sometimes fail because some other part of the system is not ready yet. In such cases, I want to retry the lambda in a couple seconds (preferably with an exponential backoff). How do I implement that?
It seems like feeding the lambda from an SQS or SNS queue might do the trick, but I can't figure out how to make it fail back to the queue and get retried.
From the Docs:
Any Lambda function invoked asynchronously is retried twice
(You can increase that retry number) You can set a Dead Letter Queue to keep all failed events for inspection, notification, etc. You can implement further logic to resend those events again or drop them, but IMHO you should have a dedicated lambda for that.