How to detect stale workers (or auto-restart) - django

We recently experienced a nasty situation with the celery framework. There were a lot of messages in the queue, however those messages weren't processed. We restarted celery and the messages started being processed again. However we do not want a situation like this happening again and are looking for a permanent solution.
It appears that celery's workers have gone stale. The documentation of celery notes the following on stale workers:
This shows that there’s 2891 messages waiting to be processed in the task queue, and there are two consumers processing them.
One reason that the queue is never emptied could be that you have a stale worker process taking the messages hostage. This could happen if the worker wasn’t properly shut down.
When a message is received by a worker the broker waits for it to be acknowledged before marking the message as processed. The broker will not re-send that message to another consumer until the consumer is shut down properly.
If you hit this problem you have to kill all workers manually and restart them
See documentation
However this relies on manual checking for stale workers, leaving lots of room for error and costing manual labor. What would be a good solution to keep celery working?

You could use supervisor or supervisor-like tools to deploy the workers, refer to Running the worker as daemon .
Moreover, you could monitor the queue status with rabbitmq-management, to check if the queue become too large, assume that you are using RabbitMQ; celery monitoring also provide some mechanisms for monitoring

Related

How do confirm_publish and acks_late compare in Celery?

I've noticed the following in the docs:
Ideally task functions should be idempotent: meaning the function won’t cause unintended effects even if called multiple times with the same arguments. Since the worker cannot detect if your tasks are idempotent, the default behavior is to acknowledge the message in advance, just before it’s executed, so that a task invocation that already started is never executed again.
If your task is idempotent you can set the acks_late option to have the worker acknowledge the message after the task returns instead. See also the FAQ entry Should I use retry or acks_late?.
If set to True messages for this task will be acknowledged after the task has been executed, not just before (the default behavior).
Note: This means the task may be executed multiple times should the worker crash in the middle of execution. Make sure your tasks are idempotent.
Then there is the BROKER_TRANSPORT_OPTIONS = {'confirm_publish': True} option found here. I could not find official documentation for that.
I want to be certain that tasks which are submitted to celery (1) arrive at celery and (2) eventually get executed.
Here is how I think it works:
Celery stores the information which tasks should get executed in a broker (typically RabbitMQ or Redis)
The application (e.g. Django) submits a task to Celery which immediately stores it in the broker. confirm_publish confirms that it was added (right?). If confirm_publish is set but the confirmation is missing, it retries (right?).
Celery takes messages from the broker. Now celery behaves as a consumer for the broker. The consumer acknowledges (confirms) that it received a message an the broker stores this information. If the consumer didn't sent an acknowledgement, the broker will re-try to send it.
Is that correct?

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Does the Spring SqsListener wait until the last message is processed (or completed) from the current poll before the next poll of messages happens?

I have a SQS Listener with a max message count of 10. When my consumer receives a batch of 10 message they all get processed but sometimes (depending on the message) the process will take 5-6 hours and some with take as little as 5 minutes. I have 3 consumers (3 different JVM's) polling from the queue with a maxMessageCount of 10. Here is my issue:
If one of those 10 messages takes 5 hours to process it seems as though the listener is waiting to do the next poll of 10 messages until all of the previous messages are 100% complete. Is there a way to allow it to poll a new batch of messages even though another is still being processed?
I'm guessing that I am missing something little here. How I am using Spring Cloud library and the SqsListener annotation. Has anybody ran across this before?
Also I dont think this should matter but the queue is AWS SQS and there JVM's are running on an ECS cluster.
If you run the task on the poller thread, the next poll won't happen until the current one completes.
You can use an ExecutorChannel or QueueChannel to hand the work off to another thread (or threads) but you risk message loss if you do that.
Your situation is rather unusual; 5 hours is a long time to process a message.
You should perhaps consider redesigning your application to persist these "long running" requests to a database or similar, instead of processing them directly from the message. Or, perhaps put them in a different queue so that they don't impact the shorter tasks.

Workflow of celery

I am a beginner with django, I have celery installed.
I am confused about the working of the celery, if the queued works are handled synchronously or asynchronously. Can other works be queued when the queued work is already being processed?
Celery is a task queuing system, that is backed by a message queuing system, Celery allows you to invoke tasks asynchronously, in a way that won't block your process for the task to finish, you can wait for the task to finish using the AsyncResult.get.
Other tasks can be queued while a task is being processed, and if Celery is running more than one process/thread (which is the default case), tasks will be executed in parallel to each others.
It is your responsibility to make sure that related tasks are executed in the correct order, e.g. if the output of a task A is an input to the other task B then you should make sure that you get the result from task A before you start the task B.
Read Avoid launching synchronous subtasks from Celery documentation.
I think you're possibly a bit confused about what Celery does.
Celery isn't really responsible for queueing at all. That is taken care of by the queue itself - RabbitMQ, Redis, or whatever. The only way Celery gets involved at this end is as a library that you call inside your app to serialize to task into something suitable for putting onto the queue. Since that is done by your web app, it is exactly as synchronous or asynchronous as your app itself: usually, in production, you'd have multiple processes running your site, each of those could put things onto the queue simultaneously, but each queueing action is done in-process.
The main point of Celery is the separate worker processes. This is where the asynchronous bit comes from: the workers run completely separately from your web app, and pick tasks off the queue as necessary. They are not at all involved in the process of putting tasks onto the queue in the first place.

What's common practice for enabling an locking mechanism for multiple SQS consumers in Django so I can be idempotent

SQS expects your application to be idempotent and I've got multiple consumers/producers where (even if SQS had a deliver-once mechanism) I will have race conditions creating duplicates and race conditions consuming because my consumers run via cron jobs.
My current plan is to use the Django 1.4 select_for_update which should block other consumers on the same row, doing something like:
reminders = EmailReminder.objects.select_for_update().filter(id=some_id)
if not reminders[0].finished:
reminder.send()
reminder.update(finished=datetime.now())
# Delete job.
Are there better ways of dealing with this?
Hook up django-celery to SQS and have it designate a periodic job using celerybeat. Then have celeryd worker(s) running on the same queue anywhere you want. Only one will pick up a job at a time and execute it. No need to introduce DB locking on any level.
As long as your worker is guaranteed to finish its current task before celerybeat fires a new one you will never have a need for a lock. Now if you think there is a chance they may overlap you can introduce states for your notifications where:
Any reminder starts in "unsent" state.
Your celerybeat sends a request to process unsent emails to the queue.
Some worker picks it up and grabs all of them.
Immediately the worker transitions all of them to "sending" state.
Proceeds to send them one at a time (or in bulk).
If sending fails for any, revert their state back to unsent.
For all that succeeded transition to sent.
This way if celerybeat fires another job while your original job is not done with the initial batch, you won't have duplicate emails sent. As an added bonus you can scale the solution and distribute the load.