I want to design AWS lambda in manner where if its failed then lambda should attempt to retry for given no of time & if after those many attempts it still fail then alert user.
I tied to configure AWS CloudWatch alarm for failure but looks like user is alert on first failure of lambda.
how about using SQS for the DLQ(Dead Letter Queue)?
You can make fault tolerant architecture using SQS and Lambda together.
Briefly saying, you can make two functions at lambda.
Function1: the first one which is triggered by initial trigger and do it's job. If it fails, it will go to SQS.
Function2 : this is triggered by SQS polling, which means that this function runs when there is a message in SQS queue. It reads SQS message so the event handler should be little bit different.
Related
When a file is added to my S3 bucket an S3PUT Event is triggered which puts a message into SQS. I've configured a Lambda to be triggered as soon as a message is available.
In the lambda function, I'm sending an API request to run a task on an ECS Fargate container with environment variables containing the message received from SQS. In the container I'm using the message to download the file from S3, do processing and on successful processing I wish to delete the message from SQS.
However the message gets deleted from SQS automatically after my lambda executes.
Is there any way that I can configure the lambda not to automatically delete the SQS message (other than raising an exception and failing the lambda purposely), so that I can programmatically delete the message from my container?
Update:
Consider this scenario which I wish to achieve.
Message enters SQS queue
Lambda takes the message & runs ECS API and finishes without deleting the msg from queue.
Msg is in-flight.
ECS container runs the task and deletes msg from queue on successful processing.
If container fails, after the visibility timeout the message will re-enter the queue and the lambda will be triggered again and the cycle will repeat from step 1.
If container fails more than a certain number of times, only then will message go from in-flight to DLQ.
This all currently works only if I purposely raise an exception on the lambda and I'm looking for a similar solution without doing this.
The behaviour is intended and as long as SQS is configured as a Lambda trigger, once the function returns (i.e. completes execution) the message is automatically deleted.
The way I see it, to achieve the behaviour you're describing you have 4 options:
Remove SQS as Lambda trigger and instead execute the Lambda Function on a schedule and poll the queue yourself. The lambda will read messages that are available but unless you delete them explicitly they will become available again once their visibility timeout is expired. You can achieve this with a CloudWatch schedule.
Remove SQS as Lambda trigger and instead execute the Lambda Function explicitly. Similar to the above but instead of executing on a schedule all the time, the Lambda function could be triggered by the producer of the message itself.
Keep the SQS Lambda trigger and store the message in an alternative SQS Queue (as suggested by #jarmod in a comment above).
Configure the producer of the message to publish a message to an SNS Topic and subscribe 2 SQS Queue to this topic. One of the two queues will trigger a Lambda Function, the other one will be used by your ECS tasks.
Update
Based on the new info provided, you have another option:
Leave the event flow as it is and let the message in the SQS be deleted by Lambda. Then, in your ECS Task, handle the failure state and put a new message in the SQS with the same payload/body. This will allow you to retry indefinitely.
There's no reason why the SQS message has to be the exact same, what you're interested is the body/payload.
You might want to consider adding a mechanism to set a limit to these retries and post a message to a DLQ.
One solution I can think of is: remove lambda triggered by the sqs queue, create an alarm that on sqs queue. When the alarm triggers, scale out the ecs task. When there's no item in the queue, scale down the ecs task. Let the ecs task just poll the queue and handle all the messages.
I have lambda using SQS events as inputs. The SQS queue also has a DLQ.
The lambda function invokes a downstream Restful API (call this operation DoPostToAPI())
I need to guarantee that the lambda function attempts to call DoPostToAPI() at least 2 times (before message goes to DLQ)
What configuration of Lambda Retries and SQS Redrive policy would I need to set in order to accomplish the above requirement?
I need to be 100% certain that messages that arrive on the DLQ only arrive because they have attempted to been sent to downstream API DoPostToAPI() 2 times, and that messages dont arrive in DLQ for any other reason, if possible.
To me, it makes sense that messages should only arrive on the DLQ if the operation was attempted, and not for other reasons (i.e. I dont want messages to arrive on DLQ purely because of throttling, since the DoPostToAPI() should be attempted first before sending to DLQ) Why would I want messages on DLQ if the lambda function operation wasnt even attempted? In order words, I need the lambda operation to be guaranteed to be invoked before item moves to DLQ.
Can I get some help on this? Is it possible to guarantee that messages on the DLQ have arrived because of failed DoPostToAPI() api calls? Or is it (more unfortunate) possible that messages arrive on DLQ for reasons other than failed calls to downstream API?
From what I have read online so far, its possible that lambda , after doing receive on SQS message and moving the message to invisibile on the queue, could run into throttling issues and re-attempt the lambda invocation. But if it runs into lambda throttling again, it could end up back on main queue, which if it reaches its max receive count, could place the message on the DLQ without the lambda having been attempted at all. Is this correct?
For simplicity lets imagine the following inputs
SQSQueue1
SQSQueue1DLQ
LambdaFunction1 --> ServiceClient1.DoPostToAPI()
What is the interplay between the lambda "maximum_retry_attempts" and the SQS redrive_policy "maxReceiveCount"
In order to ensure your lambda attempts retries when using SQS, You only need set the SQS property
maxReceiveCount
This value controls how many lambda invocations will be attempted for a given batch before a message goes to the Dead Letter queue.
Unfortunately, the lambda property
maximum_retry_attempts
Does not apply for lambda functions using SQS as function event trigger.
I have a lambda function is responsible for checking the server status. It needs to be called when SQS receives new messages and It is not allowed to change anything in SQS. I tried using SQS Lambda trigger but it will push the message into lambda function => that changed SQS queue.
I am looking the way to handle this problem. I try to use CloudWatch to handle this but I don't know is this possible or not? How Cloudwatch can trigger Lambda functions when SQS receives new messages?
Thanks in advance.
This will be difficult because, if the message is consumed quickly, it might not have an impact on Amazon CloudWatch metrics. You'll need to experiment to see whether this is the case. For example, set a metric for the maximum number of messages received in a 1-minute time period and try to trigger a CloudWatch Alarm when it is greater than zero.
Alternatively, have the system that sends the SQS message send it to Amazon SNS instead. Then, both the SQS queue and the Lambda function can subscribe to the SNS topic and both can be activated.
In fact, I know somebody who always uses SNS in front of SQS "just in case" this type of thing is necessary.
I have a AWS Lambda that is running every minute. It will either succeed or pass. I have a AWS Alarm that is monitoring this Lambda and going into an ALERT or SUCCESS state based on the Lambda execution. When the Alarm state changes a SNS message is fired off an another Lambda is triggered. This Lambda uses a webhook and sends out a message.
Is there a way of getting the error message from the 1st Lambda to be viewed by the 2nd Lambda (and ultimatley passed to the webhook)? I can see the error message on the CloudWatch logs.
Any ideas on this would be great.
The reason I have a Alarm inbetween the 1st Lambda and the SNS is I only want a message when the state changes not every time the 1st Lambda runs.
I have the following pipeline in place to move events:-
Service -> SNS -> AWS Lambda -> Dynamo Db.
So, basically, Service is publishing data to SNS Topic which gets subscribed by AWS Lambda Function. Then, this AWS Lambda Function pushes the data to Dynamo Db. Now, I am adding a DLQ with AWS Lambda to store error processed messages.
Error messages can be due to an error in publisher application or consumer application. Eg. Publisher changed the format of data being published and say I am not supporting it in AWS Lambda and it gives some error.
I wanted to know after pushing to DLQ such messages, what do we normally do?
Do we try again to push the data by changing the AWS Lambda function? Is this step done manually or we make a job which pushes the data from DLQ to lambda function periodically?
We normally just put an alarm on DLQ and then manually handle this?
Since Sometimes, the issue can be due to Dynamo Db connection first time, which would be handled next time if we push. If we do it manually, then it would be a problem.
I’m addition to Lambda DLQs, you should consider adding SNS DLQs:
https://aws.amazon.com/blogs/compute/designing-durable-serverless-apps-with-dlqs-for-amazon-sns-amazon-sqs-aws-lambda/
I can comment here for SQS -> DLQ
Don't need to move the message because it will come with so many other challenges like duplicate messages, recovery scenarios, lost message, de-duplication check and etc.
Here is the solution which we implemented -
Usually, we use the DLQ for transient errors, not for permanent errors. So took below approach -
Read the message from DLQ like a regular queue
Benefits
To avoid duplicate message processing
Better control on DLQ- Like I put a check, to process only when the regular queue is completely processed.
Scale up the process based on the message on DLQ
Then follow the same code which regular queue is following.
More reliable in case of aborting the job or the process got terminated while processing (e.g. Instance killed or process terminated)
Benefits
Code reusability
Error handling
Recovery and message replay
Extend the message visibility so that no other thread process them.
Benefit
Avoid processing same record by multiple threads.
Delete the message only when either there is a permanent error or successful.
Benefit
Keep processing until we are getting a transient error.
AWS Lambda Dead Letter Queues directs events that cannot be processed to the Amazon SNS topic or Amazon SQS queue that you’ve configured for the Lambda function.
So handling the error with given payload, using a service subscribed to the SNS topic or reading messages from SQS is up to the developer to decide. Addressing the questions listed,
You can use another Lambda function subscribed to a SNS topic to process the message.
Yes, its more similar to setup alarm and manually handle it.
By default, a failed Lambda function invoked asynchronously is retried twice, and then the event is discarded unless there is a DLQ setup. So if its a dynamodb connection problem, probably solved in the second invocation.