What to do when Celery broker is down?

What to do when Celery broker is down? - django

I have a Celery server with a RabbitMQ broker. I use it to run background tasks in my Django project.
In one of my views a signal is triggered which then calls a celery task like this:
create_something.delay(pk)
The task is defined like this:
#task
def create_something(donation_pk):
# do something
Everything works great, but:
If RabbitMQ is down when I am calling the task no error is thrown during the create_something.delay(pk) call. But the view throws this error:
[Errno 111] Connection refused
(The stack trace is kind of useless, I think this is because of the signals used)
The question now is: How can I prevent this error? Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
Thanks in advance for any hints!

Celery tasks has a .run() method, which will execute the task as it were part of the normal code flow.
create_something.run(pk)
You could catch the exception and execute .run() if needed.

Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
The exception thrown when you call the .delay() method and you cannot connect can be caught just like any other exception:
try:
foo.delay()
except <whatever exception is actually thrown>:
# recover
You could build a loop around this to retry but you should take care not to keep the request up for very long. For instance, if it takes a whole second for your connectivity problem to get resolved, you don't want to hold up the request for a whole second. An option here may be to abort quickly but use the logging infrastructure so that an email is sent to the site administrators. A retry loop would be the last thing I'd do once I've identified what causes the connectivity issue and I have determined it cannot be helped. In most cases, it can be helped, and the retry loop is really a bandaid solution.
How can I prevent this error?
By making sure your broker does not go down. To get a more precise answer, you'd have to give more diagnostic information in your question.
By the way, Celery has a notion of retrying tasks but that's for when the task is already known to the broker. It does not apply to the case where you cannot contact the broker.

Related

Google Cloud Composer Airflow sqlalchemy OperationalError causing DAG to hang forever

I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?

The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.

Best place to retry a task in Django: requests or celery task

I am hitting an API using requests library in Django inside a celery task. To be very specific, it fetches some record from database, prepares a json and does a POST request. In the certain case scenario, the call fails with 500 error code. I want to retry the POST request again. What's the best way to go about it and why?
Retry the Celery task itself (See implementation)
Retry the request using urllib.util.retry (See full implementation)

Each Celery job runs in a separate process. In your case, repeating a 500-returned POST request has nothing to do with creating another process. One process will and should be enough of handling such request. So you need to retry the request using urllib.util.retry in the same Celery job and then terminate the job until you get a response with code 200.

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Django concurrency with celery

I am using django framework and ran into some performance problems.
There is a very heavy (which costs about 2 seconds) in my views.py. And let's call it heavy().
The client uses ajax to send a request, which is routed to heavy(), and waits for a json response.
The bad thing is that, I think heavy() is not concurrent. As shown in the image below, if there are two requests routed to heavy() at the same time, one must wait for another. In another word, heavy() is serial: it cannot take another request before returning from current request. The observation is tested and proven on my local machine.
I am trying to make the functions in views.py concurrent and asynchronous. Ideally, when there are two requests coming to heavy(), heavy() should throw the job to some remote worker with a callback, and return. Then, heavy() can process another request. When the task is done, the callback can send the results back to client. The logic is demonstrated as below:
However, there is a problem: if heavy() wants to process another request, it must return; but if it returns something, the django framework will send a (fake)response to the client, and the client may not wait for another response. Moreover, the fake response doesn't contain the correct data. I have searched throught stackoverflow and find less useful tips. I wonder if anyone have tried this and knows a good way to solve this problem.
Thanks,

First make sure that 'inconcurrency' is actually caused by your heavy task. If you're using only one worker for django, you will be able to process only one request at a time, no matter what it will be. Consider having more workers for some concurrency, because it will affect also short requests.
For returning some information when task is done, you can do it in at least two ways:
sending AJAX requests periodicaly to fetch status of your task
using SSE or websocket to subscribe for actual result
Both of them will require to write some more JavaScript code for handling it. First one is really easy achievable, for second one you can use uWSGI capabilities, as described here. It can be handled asynchronously that way, independently of your django workers (django will just create connection and start task in celery, checking status and sending it to client will be handled by gevent.

To follow up on GwynBliedD's answer:
celery is commonly used to process tasks, it has very simple django integration. #GwynBlieD's first suggestion is very commonly implemented using celery and a celery result backend.
https://www.reddit.com/r/django/comments/1wx587/how_do_i_return_the_result_of_a_celery_task_to/
A common workflow Using celery is:
client hits heavy()
heavy() queues heavy() task asynchronously
heavy() returns future task ID to client (view returns very quickly because little work was actually performed)
client starts polling a status endpoint using the task ID
when task completes status returns result to client

How to tolerate RabbitMQ restarts in Langohr?

We have Clojure code which reads from a Rabbit queue. We would like to tolerate the case where the RabbitMQ server is down briefly, e.g. in the case of a restart (sudo service rabbitmq-server restart).
There appears to be some provision for reconnecting in Langohr. We adapted the example clojurewerkz.langohr.examples.recovery.example1 (Gist here). Slight differences vs. the published example include the connection parameters, and the removal of the lb/publish call (since we're filling the data with an external source).
We can successfully consume data from the queue and wait for more messages. However, when we restart RMQ (via the above sudo command on the VM hosting RabbitMQ), the following exception is thrown:
Caught an exception during connection recovery!
java.io.IOException
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:106)
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:102)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:378)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:516)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:545)
at com.novemberain.langohr.Connection.recoverConnection(Connection.java:166)
at com.novemberain.langohr.Connection.beginAutomaticRecovery(Connection.java:115)
at com.novemberain.langohr.Connection.access$000(Connection.java:18)
at com.novemberain.langohr.Connection$1.shutdownCompleted(Connection.java:93)
at com.rabbitmq.client.impl.ShutdownNotifierComponent.notifyListeners(ShutdownNotifierComponent.java:75)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:573)
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; reason: java.io.EOFException
at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:67)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:343)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:321)
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:273)
at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:95)
at com.rabbitmq.client.impl.SocketFrameHandler.readFrame(SocketFrameHandler.java:131)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:533)
It seems likely that the intended restart mechanism provided by Langohr is breaking when it kicks in. Is there an alternative pattern which is preferred in the case of these "hard" restarts? Alternatively, I suppose we have to implement connection monitoring and retries ourselves. Any suggestions would be most welcome.

We used to see such stack traces, but we no longer see them with Langohr 2.9.0. After a restart, our clojure clients reconnect and messages start flowing again.
We are using the defaults, which have connection and topology coverage turned on, as shown by this:
(infof "Automatic recovery enabled? %s" (rmq/automatic-recovery-enabled? connection))
(infof "Topology recovery enabled? %s" (rmq/automatic-topology-recovery-enabled? connection))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js