RabbitMQ not closing old connections with Celery - django

I use Celery with Django to consume/publish tasks to RabbitMQ from ~20 workers across a few datacenters. After about a month or so, I'm at 8000 open socket descriptors and the number keeps increasing until I restart RabbitMQ. Often I "kill -9" the Celery worker process instead of shutting them down since I do not want to wait for jobs to finish. On the workers I do not see the connections that RabbitMQ is showing. Is there a way to purge the old connections from RabbitMQ?
I'm using Celery 3.1.13 and RabbitMQ 3.2.4, all on Ubuntu 14.04. I'm not using librabbitmq, but pyamqp.

I was getting the same issue with the following 3-machine setup:
Worker (Ubuntu 14.04)
amqp==1.4.6
celery==3.1.13
kombu==3.0.21
Django App Server (Ubuntu 14.04)
amqp==1.4.2
celery==3.1.8
kombu==3.0.10
Rabbit MQ Server (Ubuntu 14.04 | rabbitmq-server 3.2.4)
Each task the worker received opened one connection that never closed (according to the RabbitMQ log) and consumed ~2-3 MB of memory.
I have since upgraded Celery to the latest version on my Django server and the socket descriptors and memory usage are holding steady.
I also see the connections close in the RabbitMQ log after the task completes, like so:
closing AMQP connection <0.12345.0> (192.168.1.100:54321 -> 192.168.1.100:5672):
connection_closed_abruptly

Use BROKER_HEARTBEAT in Django's settings.py file.
RabbitMQ expects this value from the client(Celery in this case).
Refer to
http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-broker_heartbeat for more details.

Related

Scheduled Celery Task Lost From Redis

I'm using Celery in Django with Redis as the Broker.
Tasks are being scheduled for the future using the eta argument in apply_async.
After scheduling the task, I can run celery -A MyApp inspect scheduled and I see the task with the proper eta for the future (24 hours in the future).
Before the scheduled time, if I restart Redis (with service redis restart) or the server reboots, running celery -A MyApp inspect scheduled again shows "- empty -".
All scheduled tasks are lost after Redis restarts.
Redis is setup with AOF, so it shouldn't be losing DB state after restarting.
EDIT
After some more research, I found out that running redis-cli -n 0 hgetall unacked both before and after the redis restart shows the tasked in the queue. So redis still has knowledge of the task, but for some reason when redis restarts, the task is removed from the worker? And then never sent again and it just stays indefinitely in the unakced queue.

Cannot connect Celery to RabbitMQ on Windows Server

I am trying to setup rabbitMQ to use as a message broker for Celery. I am trying to set these up on a Windows Server 2012 R2. After I start the rabbitMQ server using the RabbitMQ start service on the applications menu, I try to start the celery app with the command.
celery -A proj worker -l info
I get the following error after the above command.
[2018-01-09 10:03:02,515: ERROR/MainProcess] consumer: Cannot connect to amqp://
guest:**#127.0.0.1:5672//: [WinError 10042] An unknown, invalid, or unsupported
option or level was specified in a getsockopt or setsockopt call.
Trying again in 2.00 seconds...
So, I tried debugging, by check the status of the RabbitMQ server, for which I went into the RabbitMQ command prompt and typed rabbitmqctl status, on which I got the following response.
These are the services that I used to start RabbitMQ and the RabbitMQ command line
Here's my Django settings for Celery. I tried putting ports and usernames before and after the hosts, but same error.
CELERY_BROKER_URL = 'amqp://localhost//'
CELERY_RESULT_BACKEND = 'amqp://localhost//'
What is the issue here? How do I check if the RabbitMQ service started or not? What setting do I need to put on the Django Settings file.
I was fighting the same issue. Ended up downgrading amqp to 2.1.3 based on the open issue in py-amqp:
https://github.com/celery/py-amqp/issues/130
Uninstall amqp using pip uninstall amqp
Install amqp using pip install -Iv amqp==2.1.3

uWSGI downtime when restart

I have a problem with uwsgi everytime I restart the server when I have a code updates.
When I restart the uwsgi using "sudo restart accounting", there's a small gap between stop and start instance that results to downtime and stops all the current request.
When I try "sudo reload accounting", it works but my memory goes up (double). When I run the command "ps aux | grep accounting", it shows that I have 10 running processes (accounting.ini) instead of 5 and it freezes up my server when the memory hits the limit.
accounting.ini
I am running
Ubuntu 14.04
Django 1.9
nginx 1.4.6
uwsgi 2.0.12
This is how uwsgi does graceful reload. Keeps old processes until requests are served and creates new ones that will take over incoming requests.
Read Things that could go wrong
Do not forget, your workers/threads that are still running requests
could block the reload (for various reasons) for more seconds than
your proxy server could tolerate.
And this
Another important step of graceful reload is to avoid destroying
workers/threads that are still managing requests. Obviously requests
could be stuck, so you should have a timeout for running workers (in
uWSGI it is called the “worker’s mercy” and it has a default value of
60 seconds).
So i would recommend trying worker-reload-mercy
Default value is to wait 60 seconds, just lower it to something that your server can handle.
Tell me if it worked.
Uwsgi chain reload
This is another try to fix your issue. As you mentioned your uwsgi workers are restarting in a manner described below:
send SIGHUP signal to the master
Wait for running workers.
Close all of the file descriptors except the ones mapped to sockets.
Call exec() on itself.
One of the cons of this kind of reload might be stuck workers.
Additionaly you report that your server crashes when uwsgi maintains 10 proceses (5 old and 5 new ones).
I propose trying chain reload. DIrect quote from documentation explains this kind of reload best:
When triggered, it will restart one worker at time, and the following worker is not reloaded until the previous one is ready to accept new requests.
It means that you will not have 10 processes on your server but only 5.
Config that should work:
# your .ini file
lazy-apps = true
touch-chain-reload = /path/to/reloadFile
Some resources on chain reload and other kinds are in links below:
Chain reloading uwsgi docs
uWSGI graceful Python code deploy

Django celery tasks in separate server

We have two servers, Server A and Server B. Server A is dedicated for running django web app. Due to large number of data we decided to run the celery tasks in server B. Server A and B uses a common database. Tasks are initiated after post save in models from Server A,webapp. How to implement this idea using rabbitmq in my django project
You have 2 servers, 1 project and 2 settings(1 per server).
server A (web server + rabbit)
server B (only celery for workers)
Then you set up the broker url in both settings. Something like this:
BROKER_URL = 'amqp://user:password#IP_SERVER_A:5672//' matching server A to IP of server A in server B settings.
For now, any task must be sent to rabbit in server A to virtual server /.
In server B, you must just initialize celery worker, something like this:
python manage.py celery worker -Q queue_name -l info
and thats it.
Explanation: django sends messages to rabbit to queue a task, then celery workers request some message to execute a task.
Note: Is not required that rabbitMQ have to be installed in server A, you can install rabbit in server C and reference it in the BROKER_URL in both settings(A and B) like this: BROKER_URL='amqp://user:password#IP_SERVER_C:5672//'.
Sorry for my English.
greetings.

Celery and RabbitMQ timeouts and connection resets

I'm using RabbitMQ 3.6.0 and Celery 3.1.20 on a Windows 10 machine in a Django application. Everything is running on the same computer. I've configured Celery to Acknowledge Late (CELERY_ACKS_LATE=True) and now I'm getting connection problems.
I start the Celery worker, and after 50-60 seconds of handling tasks each worker thread fails with the following message:
Couldn't ack ###, reason:ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
(### is the number of the task)
When I look at the RabbitMQ logs I see this:
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672)
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.254.0> (127.0.0.1:55373 -> 127.0.0.1:5672)
=ERROR REPORT==== 10-Feb-2016::22:17:14 ===
closing AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
The error occurs exactly when the Celery workers are getting their connection reset.
I thought this was an AMQP Heartbeat issue, so I've added BROKER_HEARTBEAT = 15 to my Celery settings, but it did not make any difference.
I was having a similar issue with Celery on Windows with long running
tasks with concurrency=1. The following configuration finally worked for
me:
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
I also started the celery worker daemon with the -Ofair option:
celery -A test worker -l info -Ofair
In my limited understanding, CELERYD_PREFETCH_MULTIPLIER sets the number
of messages that sit in the queue of a specific Celery worker. By
default it is set to 4. If you set it to 1, each worker will only
consume one message and complete the task before it consumes another
message. I was having issues with long-running task because the
connection to RabbitMQ was consistently lost in the middle of the long task, but
then the task was re-attempted if any other message/tasks were waiting
in the celery queue.
The following option was also specific to my situation:
CELERYD_CONCURRENCY = 1
Setting concurrency to 1 made sense for me because I had long running
tasks that needed a large amount of RAM so they each needed to run solo.
#bbaker solution with CELERY_ACKS_LATE (which is task_acks_late in celery 4x) itself did not work for me. My workers are in Kubernetes pods and must be run with --pool solo and each task takes 30-60s.
I solved it by including broker_heartbeat=0
broker_pool_limit = None
task_acks_late = True
broker_heartbeat = 0
worker_prefetch_multiplier = 1