AWS SQS: Delay making available a message that failed to process - amazon-web-services

Here's my scenario:
I have an SQS queue which processes a number of tasks. Those tasks can, and often times do, fail. Their failure is common and somewhat expected.
When a task fails, I want to retry it after a certain amount of time, and fail the item into a DLQ after a certain amount of retries. I do not want to retry immediately.
I have a worker EB app which processes these tasks. When it succeeds, I return 200 (and the task is successfully removed from the queue). When it fails I return 404, and the task is immediately returned to the queue (and, thus, immediately retried). This is not desired, I'd like to delay this failed item before it is retried.
Is it possible to do this with a combination of visibility timeouts and delay queues?

You can do this natively with SQS by calling ChangeMessageVisibility on a message you just failed to process and setting the VisibilityTimeout to whatever you want: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ChangeMessageVisibility.html

Answered my own question, turns out I was looking in the wrong place (SQS config options, not EB config options). The magic settings I was looking for is "error visibility timeout" in the EB config options, which allows you to control the amount of time a failed item has before returning to its queue.

Related

AWS Lambda triggered twice for a sigle SQS Message

I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.

How does sqs message deletion flow works in celery-sqs

I was looking for how celery and sqs deletion works?
When does celery delete message from sqs?
Does it delete when message is picked from sqs or after completion of tasks?
What happens if there is some error in task or it raises error?
Will the message (task) will be there if the tasks is taking too long like and 20 mins.
When does celery delete message from sqs?
Message will be deleted after completing the task.
What happens if there is some error in task or it raises error?
Message is still with broker, and deleted after max_retries reaches.
Will the message (task) will be there if the tasks is taking too long like and 20 mins?
This depends on visibility timeout. Message goes to "Not Visible" state, till your visibility timeout, after that it is available to worker.
(if visibility timeout is less than retry time, worker will consume same message many times).
Best Practice is (visibility timeout) > (max_retries * retry_time)
The selected answer is (unfortunately) incorrect for SQS, as this open issue indicates.
There was an attempt at fixing the issue, as evidenced by this merged PR.
However, there are bugs with the above implementation.
Long story short, messages will be deleted from an SQS queue 100% of the time, regardless of any exception that occurs within the task.
edit: this may have been resolved, per this PR
I'll update this answer after I've confirmed via personal testing that this functions correctly

Does SQS kill long running jobs that have Thread.sleep?

I have some java code that calls Thread.sleep(100_000) inside a job running in SQS. In production, during the sleep the job is often killed and re-submitted as failed. On dev I can never re-create that. Does SQS in production kill long running jobs?
SQS doesn't kill jobs - and I am not sure what you mean by you have code 'running in SQS' - what SQS does do, is to assume your job (which is running someplace other than SQS), has failed if you don't mark it completed within the timeout (Default Visibility Timeout) you set when you setup the queue.
Your job asks SQS for an item to work on (a message to process) - your job is supposed to do that work and then tell SQS that the job is now done (deletemessage). If you don't tell it it is done, SQS makes an assumption for you that the job has failed and puts the message back in the queue for another task to process.
If you need more time to complete the tasks, you can change the default visibility timeout to a higher value if you want.

On Demand Scheduler

I have a daemon which constantly pools an AWS SQS queue for messages, once it does receive a message, I need to keep increasing the visibility timeout until the message is processed.
I would like to set up an "on demand scheduler" which increases the visibility timeout of the message every X minutes or so and then stops the scheduler once the message is processed.
I have tried using the Spring Scheduler (https://spring.io/guides/gs/scheduling-tasks/) but that doesn't meet my needs since it's not on demand and runs no matter what.
This is done on a distributed system with a large fleet.
A message can take up to 10 hours to completely process.
We cannot set the default visibility timeout for the queue to be a high number (due to other reasons).
I would just like to know if there is a good library out there that I can leverage for doing this? Thanks for the help!
The maximum visibility timeout for an SQS message is 12 hours. You are nearing that limit. Perhaps you should consider removing the message from the queue while it is being processed and if an error occurs or the need arises you can re-queue the message.
You can set a trigger for Spring Scheduler allowing you to manually set the next execution time. Refer to this answer. This gives you more control over when the scheduled task runs.
Given the scenario, pulling a message (thus having the visibility timeout timer start) and then trying to acquire a lock was not the most feasible way to go about doing this (especially since messages can take so long to process).
Since the messages could potentially take a very long time to process and thus delete, its not feasible to keep having to increase the timeout for messages that you've pulled. Thus, we went a different way.
We first acquire a lock and then pull the message and then increase the visibility timeout to 11 hours, after we've gotten a lock.

Elastic Beanstalk Workers and the VisibilityTimeout period

According to http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html:
If the application returns any response other than 200 OK, then Elastic
Beanstalk waits to put the message back in the queue after the configured
VisibilityTimeout period.
I have set the VisibilityTimeout to 1 minute. My app is returning a 400 error when processing the request. I see from the logs that the request is being re-tried every 2 seconds! I was expecting, based on the above, for it to retry every 60 seconds.
What am I missing?
This might not be the issue of the SQS queue at all. It is true that the message is returned to the queue only after the specified VisibilityTimeout, but it depends on how you are polling the messages.
If you do not access the queue directly (but use some kind of service to do it for you), you have another layer of complexity there.
There's a worker process in Elastic Beanstalk called sqsd that's doing the polling, (processing the messages and deleting them from the queue once you respond with 200).
The sqsd uses similar concept called InactivityTimeout - this specifies the time for this worker to wait for the 200 response and it resends the message after this time if such response is not delivered.
My guess is that the cause of your problem is in this InactivtyTimeout.
If this is not the cause, try looking into the WaitTimeSeconds parameter of your SQS. This specifies that the call to the SQS should be returned immediately if there are messages in the queue (otherwise, it waits for the specified time).
I had a similar issue with an EC2 instance and I specified all the timeouts. In the end - it turned it was caused by a bug in Tomcat - see this: https://forums.aws.amazon.com/thread.jspa?threadID=183473