How does sqs message deletion flow works in celery-sqs - django

I was looking for how celery and sqs deletion works?
When does celery delete message from sqs?
Does it delete when message is picked from sqs or after completion of tasks?
What happens if there is some error in task or it raises error?
Will the message (task) will be there if the tasks is taking too long like and 20 mins.

When does celery delete message from sqs?
Message will be deleted after completing the task.
What happens if there is some error in task or it raises error?
Message is still with broker, and deleted after max_retries reaches.
Will the message (task) will be there if the tasks is taking too long like and 20 mins?
This depends on visibility timeout. Message goes to "Not Visible" state, till your visibility timeout, after that it is available to worker.
(if visibility timeout is less than retry time, worker will consume same message many times).
Best Practice is (visibility timeout) > (max_retries * retry_time)

The selected answer is (unfortunately) incorrect for SQS, as this open issue indicates.
There was an attempt at fixing the issue, as evidenced by this merged PR.
However, there are bugs with the above implementation.
Long story short, messages will be deleted from an SQS queue 100% of the time, regardless of any exception that occurs within the task.
edit: this may have been resolved, per this PR
I'll update this answer after I've confirmed via personal testing that this functions correctly

Related

AWS Lambda triggered twice for a sigle SQS Message

I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.

Does the Spring SqsListener wait until the last message is processed (or completed) from the current poll before the next poll of messages happens?

I have a SQS Listener with a max message count of 10. When my consumer receives a batch of 10 message they all get processed but sometimes (depending on the message) the process will take 5-6 hours and some with take as little as 5 minutes. I have 3 consumers (3 different JVM's) polling from the queue with a maxMessageCount of 10. Here is my issue:
If one of those 10 messages takes 5 hours to process it seems as though the listener is waiting to do the next poll of 10 messages until all of the previous messages are 100% complete. Is there a way to allow it to poll a new batch of messages even though another is still being processed?
I'm guessing that I am missing something little here. How I am using Spring Cloud library and the SqsListener annotation. Has anybody ran across this before?
Also I dont think this should matter but the queue is AWS SQS and there JVM's are running on an ECS cluster.
If you run the task on the poller thread, the next poll won't happen until the current one completes.
You can use an ExecutorChannel or QueueChannel to hand the work off to another thread (or threads) but you risk message loss if you do that.
Your situation is rather unusual; 5 hours is a long time to process a message.
You should perhaps consider redesigning your application to persist these "long running" requests to a database or similar, instead of processing them directly from the message. Or, perhaps put them in a different queue so that they don't impact the shorter tasks.

On Demand Scheduler

I have a daemon which constantly pools an AWS SQS queue for messages, once it does receive a message, I need to keep increasing the visibility timeout until the message is processed.
I would like to set up an "on demand scheduler" which increases the visibility timeout of the message every X minutes or so and then stops the scheduler once the message is processed.
I have tried using the Spring Scheduler (https://spring.io/guides/gs/scheduling-tasks/) but that doesn't meet my needs since it's not on demand and runs no matter what.
This is done on a distributed system with a large fleet.
A message can take up to 10 hours to completely process.
We cannot set the default visibility timeout for the queue to be a high number (due to other reasons).
I would just like to know if there is a good library out there that I can leverage for doing this? Thanks for the help!
The maximum visibility timeout for an SQS message is 12 hours. You are nearing that limit. Perhaps you should consider removing the message from the queue while it is being processed and if an error occurs or the need arises you can re-queue the message.
You can set a trigger for Spring Scheduler allowing you to manually set the next execution time. Refer to this answer. This gives you more control over when the scheduled task runs.
Given the scenario, pulling a message (thus having the visibility timeout timer start) and then trying to acquire a lock was not the most feasible way to go about doing this (especially since messages can take so long to process).
Since the messages could potentially take a very long time to process and thus delete, its not feasible to keep having to increase the timeout for messages that you've pulled. Thus, we went a different way.
We first acquire a lock and then pull the message and then increase the visibility timeout to 11 hours, after we've gotten a lock.

Azure Storage Queue and multiple WebJobs instances: will QueueTrigger set the message lease time on triggered?

Scenario: producer send a message into the Storage Queue, a WebJobs process the message on QueueTrigger, each message must only be processed once, there could be multiple WebJob instances.
I've been googling and from what I've read, I need to write the function that processes the message to be idempotent so a message isn't processed twice. I've also read that there is a default lease time of 10 minutes for a message.
My question is, when the QueueTrigger is triggered on one WebJob instance, does it set the lease time on the message so that another WebJob can't pick up the same message? If so why do I need to account for the possibility that the message can be processed twice? Or am I misunderstanding this?
If you are using the built-in queue trigger attributes, it will automatically ensure that any given message gets processed once, even when a site scales out to multiple instances. This is posted on the article in the discussion section, https://azure.microsoft.com/en-us/documentation/articles/websites-dotnet-webjobs-sdk-get-started/
In the same article you will find clarification regarding the 10 minute lease. In summary, the QueueTrigger attribute directs the WebJobs SDK to call a method when a new message is received in queue. The message is processed and when the method completes, the queue message is deleted. If the method fails before completing, the queue message is not deleted; after a 10-minute lease expires, the message is released to be picked up again and processed. This sequence won't be repeated indefinitely if a message always causes an exception. After 5 unsuccessful attempts to process a message, the message is moved to the poison queue. The maximum number of attempts is configurable.
Your process need to be idempotent. Because
Facts:
A webjob leases a message (No other webjob can get it).
A webjob deletes a message when its job is done.
If a webjob crashes while processing a message, its lease will time out and another webjob will get and start to process that. (default retry is 5 for a messsage, after that it goes to poison queue)
So if a webjob crashes after its job is done but before it deletes the message, then the message will be released after a while and the same job will be done again.
Therefore your process need to be idempotent.

RabbitMQ Visibility Timeout

Do RabbitMQ queues have a AWS SQS-like - "message visibility timeout" ?
From the AWS SQS documentation :
"The visibility timeout clock starts ticking once Amazon SQS returns the message. During that time, the component processes and deletes the message. But what happens if the component fails before deleting the message? If your system doesn't call DeleteMessage for that message before the visibility timeout expires, the message again becomes visible to the ReceiveMessage calls placed by the components in your system and it will be received again"
Thanks!
I believe you are looking for the RabbitMQ manual acknowledgment feature. This feature allows you get messages from the queue and once you have receive them ack'ed them. If something happens in the middle of this process, the message will be available again in the queue after a certain amount of time. Also, in the meantime since you get the message until you ack it, the message is not available for other consumers to consume.
I think this is the same behavior as Message Visibility Timeout of SQS.
There aren't any message timeouts; RabbitMQ will redeliver the message only when the worker connection dies. It's fine even if processing a message takes a very, very long time.There aren't any message timeouts; RabbitMQ will redeliver the message only when the worker connection dies. It's fine even if processing a message takes a very, very long time.
I believe the answer can be found # a discussion of MQ vs SQS generally this is a considered a feature of MQ (that it can handle slow consumers) but using a destination policy of "slowConsumerStrategy" with "abortSlowConsumerStrategy" might solve your problem. A fuller explanation can be found at redhat's MQ documentation and i supposed we have to hope that rabbitMQ and AmazonMQ both support that strategy.