Celery and RabbitMQ timeouts and connection resets - django

I'm using RabbitMQ 3.6.0 and Celery 3.1.20 on a Windows 10 machine in a Django application. Everything is running on the same computer. I've configured Celery to Acknowledge Late (CELERY_ACKS_LATE=True) and now I'm getting connection problems.
I start the Celery worker, and after 50-60 seconds of handling tasks each worker thread fails with the following message:
Couldn't ack ###, reason:ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
(### is the number of the task)
When I look at the RabbitMQ logs I see this:
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672)
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.254.0> (127.0.0.1:55373 -> 127.0.0.1:5672)
=ERROR REPORT==== 10-Feb-2016::22:17:14 ===
closing AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
The error occurs exactly when the Celery workers are getting their connection reset.
I thought this was an AMQP Heartbeat issue, so I've added BROKER_HEARTBEAT = 15 to my Celery settings, but it did not make any difference.

I was having a similar issue with Celery on Windows with long running
tasks with concurrency=1. The following configuration finally worked for
me:
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
I also started the celery worker daemon with the -Ofair option:
celery -A test worker -l info -Ofair
In my limited understanding, CELERYD_PREFETCH_MULTIPLIER sets the number
of messages that sit in the queue of a specific Celery worker. By
default it is set to 4. If you set it to 1, each worker will only
consume one message and complete the task before it consumes another
message. I was having issues with long-running task because the
connection to RabbitMQ was consistently lost in the middle of the long task, but
then the task was re-attempted if any other message/tasks were waiting
in the celery queue.
The following option was also specific to my situation:
CELERYD_CONCURRENCY = 1
Setting concurrency to 1 made sense for me because I had long running
tasks that needed a large amount of RAM so they each needed to run solo.

#bbaker solution with CELERY_ACKS_LATE (which is task_acks_late in celery 4x) itself did not work for me. My workers are in Kubernetes pods and must be run with --pool solo and each task takes 30-60s.
I solved it by including broker_heartbeat=0
broker_pool_limit = None
task_acks_late = True
broker_heartbeat = 0
worker_prefetch_multiplier = 1

Related

Celery throws connection reset by peer error

I have a kombu consumer which sends a task periodically to my celery application(using rabbitmq backend) as such:
celery_app.send_task('fetch_data', args=[], kwargs={})
The celery app receives the task and is able to execute it. However after a while when it receives another task in a similar fashion, I get a
[Errno 104] Connection reset by peer
In the rabbitmq logs I see a lot of heartbeat missed warnings.
I have already tried setting
broker_pool_limit = None
in my celery app, but it did not solve the problem.
How do I rectify this?

Scheduled Celery Task Lost From Redis

I'm using Celery in Django with Redis as the Broker.
Tasks are being scheduled for the future using the eta argument in apply_async.
After scheduling the task, I can run celery -A MyApp inspect scheduled and I see the task with the proper eta for the future (24 hours in the future).
Before the scheduled time, if I restart Redis (with service redis restart) or the server reboots, running celery -A MyApp inspect scheduled again shows "- empty -".
All scheduled tasks are lost after Redis restarts.
Redis is setup with AOF, so it shouldn't be losing DB state after restarting.
EDIT
After some more research, I found out that running redis-cli -n 0 hgetall unacked both before and after the redis restart shows the tasked in the queue. So redis still has knowledge of the task, but for some reason when redis restarts, the task is removed from the worker? And then never sent again and it just stays indefinitely in the unakced queue.

uwsgi listen queue fills on reload

I'm running a Django app on uwsgi with an average of 110 concurrent users and 5 requests per second during peak hours. I'm finding that when I deploy with uwsgi reload during these peak hours I am starting to run into an issue where workers keep getting slowly killed and restarted, and then the uwsgi logs begin to throw an error:
Gracefully killing worker 1 (pid: 25145)...
Gracefully killing worker 2 (pid: 25147)...
... a few minutes go by ...
worker 2 killed successfully (pid: 25147)
Respawned uWSGI worker 2 (new pid: 727)
... a few minutes go by ...
worker 2 killed successfully (pid: 727)
Respawned uWSGI worker 2 (new pid: 896)
... this continues gradually for 25 minutes until:
*** listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100) ***
At this point my app rapidly slows to a crawl and I can only recover with a hard uwsgi stop followed by a uwsgi start. There are some relevant details which make this situation kind of peculiar:
This only occurs when I uwsgi reload, otherwise the listen queue never fills up on its own
The error messages and slowdown only start to occur about 25 minutes after the reload
Even during the moment of crisis, memory and CPU resources on the machine seem fine
If I deploy during lighter traffic times, this issue does not seem to pop up
I realize that I can increase the listen queue size, but that seems like a band-aid more than an actual solution. And the fact that it only fills up during reload (and takes 25 minutes to do so) leads me to believe that it will fill up eventually regardless of the size. I would like to figure out the mechanism that is causing the queue to fill up and address that at the source.
Relevant uwsgi config:
[uwsgi]
socket = 127.0.0.1:8001
processes = 4
threads = 2
max-requests = 300
reload-on-rss = 800
vacuum = True
touch-reload = foo/uwsgi/reload.txt
memory-report = true
Relevant software version numbers:
uwsgi 2.0.14
Ubuntu 14.04.1
Django 1.11.13
Python 2.7.6
It appears that our touch reload is not graceful when we have slight traffic, is this to be expected or do we have a more fundamental issue?
On uwsgi there is a harakiri mode that will periodically kill long running processes to prevent unreliable code from hanging (and effectively taking down the app). I would suggest looking there for why your processes are being killed.
As to why a hard stop works and a graceful stop does not -- it seems to further indicate your application code is hanging. A graceful stop will send SIGHUP, which allows resources to be cleaned up in the application. SIGINT and SIGTERM follow the harsher guidelines of "stop what you are doing right now and exit".
Anyway, it boils down to this not being a uwsgi issue, but an issue in your application code. Find what is hanging and why. Since you are not noticing CPU spikes; some probable places to look are...
blocking connections
locks
a long sleep
Good luck!
The key thing you need to look is "listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100)"
Default listen queue size is 100. Increase the queue size by adding the option "listen" in uwsgi.ini.
"listen = 4096"

RabbitMQ not closing old connections with Celery

I use Celery with Django to consume/publish tasks to RabbitMQ from ~20 workers across a few datacenters. After about a month or so, I'm at 8000 open socket descriptors and the number keeps increasing until I restart RabbitMQ. Often I "kill -9" the Celery worker process instead of shutting them down since I do not want to wait for jobs to finish. On the workers I do not see the connections that RabbitMQ is showing. Is there a way to purge the old connections from RabbitMQ?
I'm using Celery 3.1.13 and RabbitMQ 3.2.4, all on Ubuntu 14.04. I'm not using librabbitmq, but pyamqp.
I was getting the same issue with the following 3-machine setup:
Worker (Ubuntu 14.04)
amqp==1.4.6
celery==3.1.13
kombu==3.0.21
Django App Server (Ubuntu 14.04)
amqp==1.4.2
celery==3.1.8
kombu==3.0.10
Rabbit MQ Server (Ubuntu 14.04 | rabbitmq-server 3.2.4)
Each task the worker received opened one connection that never closed (according to the RabbitMQ log) and consumed ~2-3 MB of memory.
I have since upgraded Celery to the latest version on my Django server and the socket descriptors and memory usage are holding steady.
I also see the connections close in the RabbitMQ log after the task completes, like so:
closing AMQP connection <0.12345.0> (192.168.1.100:54321 -> 192.168.1.100:5672):
connection_closed_abruptly
Use BROKER_HEARTBEAT in Django's settings.py file.
RabbitMQ expects this value from the client(Celery in this case).
Refer to
http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-broker_heartbeat for more details.

Jobs firing to wrong celery

I am using django celery and rabbitmq as my broker (guest rabbit user has full access on local machine). I have a bunch of projects all in their own virtualenv but recently needed celery on 2 of them. I have one instance of rabbitmq running
(project1_env)python manage.py celery worker
normal celery stuff....
[Configuration]
broker: amqp://guest#localhost:5672//
app: default:0x101bd2250 (djcelery.loaders.DjangoLoader)
[Queues]
push_queue: exchange:push_queue(direct) binding:push_queue
In my other project
(project2_env)python manage.py celery worker
normal celery stuff....
[Configuration]
broker: amqp://guest#localhost:5672//
app: default:0x101dbf450 (djcelery.loaders.DjangoLoader)
[Queues]
job_queue: exchange:job_queue(direct) binding:job_queue
When I run a task in project1 code it fires to the project1 celery just fine in the push_queue. The problem is when I am working in project2 any task tries to fire in the project1 celery even if celery isn't running on project1.
If I fire back up project1_env and start celery I get
Received unregistered task of type 'update-jobs'.
If I run list_queues in rabbit, it shows all the queues
...
push_queue 0
job_queue 0
...
My env settings and CELERYD_CHDIR and CELERY_CONFIG_MODULE are both blank.
Some things I have tried:
purging celery
force_reset on rabbitmq
rabbitmq virtual hosts as outlined in this answer: Multi Celery projects with same RabbitMQ broker backend process
moving django celery setting out and setting CELERY_CONFIG_MODULE to the proper settings
setting the CELERYD_CHDIR in both projects to the proper directory
None of these thing have stopped project2 tasks trying to work in project1 celery.
I am on Mac, if that makes a difference or helps.
UPDATE
Setting up different virtual hosts made it all work. I just had it configured wrong.
If you're going to be using the same RabbitMQ instance for both Celery instances, you'll want to use virtual hosts. This is what we use and it works. You mention that you've tried it, but your broker URLs are both amqp://guest#localhost:5672//, with no virtual host specified. If both Celery instances are connected to the same host and virtual host, they will produce to and consume from the same set of queues.