Celery unexpectedly closes TCP connection - django

I'm using RabbitMQ 3.8.2 with Erlang 22.2.7 and having a problem while consuming tasks. My configuration is django-celery-rabbitmq. While publishing messages in a queue everything goes ok until the length of the queue reaches 1200 messages. After this point RabbitMQ starts to close AMQP connection with following errors:
...
2022-11-01 09:35:25.327 [info] <0.20608.9> accepting AMQP connection <0.20608.9> (185.121.83.107:60447 -> 185.121.83.116:5672)
2022-11-01 09:35:25.483 [info] <0.20608.9> connection <0.20608.9> (185.121.83.107:60447 -> 185.121.83.116:5672): user 'rabbit_admin' authenticated and granted access to vhost '/'
...
2022-11-01 09:36:59.129 [warning] <0.19994.9> closing AMQP connection <0.19994.9> (185.121.83.108:36149 -> 185.121.83.116:5672, vhost: '/', user: 'rabbit_admin'):
client unexpectedly closed TCP connection
...
[error] <0.11162.9> closing AMQP connection <0.11162.9> (185.121.83.108:57631 -> 185.121.83.116:5672):
{writer,send_failed,{error,enotconn}}
...
2022-11-01 09:35:48.256 [error] <0.20201.9> closing AMQP connection <0.20201.9> (185.121.83.108:50058 -> 185.121.83.116:5672):
{inet_error,enotconn}
...
Then the django-celery consumer disappears from queue list, messages become "ready" and celery pods are unable to ack the message after the job is finished with the following error:
ERROR: [2022-11-01 09:20:23] /usr/src/app/project/celery.py:114 handle_message Error while handling Rabbit task: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/amqp/connection.py", line 514, in channel
return self.channels[channel_id]
KeyError: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/project/celery.py", line 76, in handle_message
message.ack()
File "/usr/local/lib/python3.10/site-packages/kombu/message.py", line 125, in ack
self.channel.basic_ack(self.delivery_tag, multiple=multiple)
File "/usr/local/lib/python3.10/site-packages/amqp/channel.py", line 1407, in basic_ack
return self.send_method(
File "/usr/local/lib/python3.10/site-packages/amqp/abstract_channel.py", line 70, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.10/site-packages/amqp/method_framing.py", line 186, in write_frame
write(buffer_store.view[:offset])
File "/usr/local/lib/python3.10/site-packages/amqp/transport.py", line 347, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
I have noticed that the message size also affects this behavior. In the above case there are like 1000-1500 symbols in each message. If I decrease it to 50 symbols, then the threshold at which RabbitMQ starts to close AMQP connection shifts to 4000-5000 messages.
I suspect that the problem is with lack of resources for RabbitMQ, but I don't know how find what exactly is going wrong. If I run htop on the server, I see that 2 available CPU are not at high load at any time (loaded less than 20% each) and RAM is 400mb / 3840mb used. So nothing seems to be wrong. Is there any resource checking command for RabbitMQ? The tasks do not take long time to complete, about 10 seconds each.
Also maybe there are some missing heartbeats from the client (I had the problem earlier, but not now, there are currently no error messages about that).
Also if I run sudo journalctl --system | grep rabbitmq, I get the following output:
......
Мау 24 05:15:49 oms-git.omsystem sshd[809111]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=43.154.63.169 user=rabbitmq
Мау 24 05:15:51 oms-git.omsystem sshd[809111]: Failed password for rabbitmq from 43.154.63.169 port 37010 ssh2
Мау 24 05:15:51 oms-git.omsystem sshd[809111]: Disconnected from authenticating user rabbitmq 43.154.63.169 port 37010 [preauth]
Мау 24 16:12:32 oms-git.omsystem sudo[842182]: ad : TTY=pts/3 ; PWD=/var/log/rabbitmq ; USER=root ; COMMAND=/usr/bin/tail -f -n 1000 rabbit#XXX-git.log
......
Maybe here is another issue with firewall, but I don't see any error messages about that in /var/log/rabbitmq/rabbit#XXX.log.
My Celery configuration on client is like:
CELERY_TASK_IGNORE_RESULT = True
CELERY_RESULT_BACKEND = 'django-db'
CELERY_CACHE_BACKEND = 'django-cache'
CELERY_SEND_EVENTS = False
CELERY_BROKER_POOL_LIMIT = 30
CELERY_BROKER_HEARTBEAT = 30
CELERY_BROKER_CONNECTION_TIMEOUT = 600
CELERY_PREFETCH_MULTIPLIER = 1
CELERY_SEND_EVENTS = False
CELERY_WORKER_CONCURRENCY = 1
CELERY_TASK_ACKS_LATE = True
Currently I'm running the pod using following command:
celery -A project.celery worker -l info -f /var/log/celery/celery.log -Ofair
Also I have tried to use various arguments to limit prefetch or turn off heartbit but it didn't work:
celery -A project.celery worker -l info -f /var/log/celery/celery.log --without-heartbeat --without-gossip --without-mingle
celery -A project.celery worker -l info -f /var/log/celery/celery.log --prefetch-multiplier=1 --pool=solo --
I expect that there are no limitations on queue length and every celery pod in my kubernetes cluster consumes and acks messages without errors.

Related

Camunda 8 python client Issue

I tried the camunda community python client, from the repo (https://github.com/camunda-community-hub/camunda-8-code-studio/tree/main/src/PythonCloudWorker). I have set up caumnda 8 saas account to run my tasks from the repo.
I 'm getting error when i try to run the python file, posting the error. Any suggestions appriciated.
communda_connect.py:59: DeprecationWarning: There is no current event loop
loop = asyncio.get_event_loop()
E0118 00:29:19.302897000 6259650560 hpack_parser.cc:1218] Error parsing metadata: error=invalid value key=content-type value=text/plain; charset=utf-8
E0118 00:29:19.307140000 6259650560 hpack_parser.cc:1218] Error parsing metadata: error=invalid value key=content-type value=text/plain; charset=utf-8
E0118 00:29:19.310754000 6259650560 hpack_parser.cc:1218] Error parsing metadata: error=invalid value key=content-type value=text/plain; charset=utf-8
Traceback (most recent call last):
env/lib/python3.10/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status
raise _create_rpc_error(await self.initial_metadata(), await
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = "Received http2 header with status: 404"
debug_error_string = "UNKNOWN:Error received from peer ipv4:32.12.17.224:443 {created_time:"2023-01-18T00:29:19.304994+05:30", grpc_status:12, grpc_message:"Received http2 header with status: 404"}"
>
During handling of the above exception, another exception occurred:
env/lib/python3.10/site-packages/pyzeebe/grpc_internals/zeebe_adapter_base.py", line 33, in _handle_grpc_error
raise pyzeebe_error
pyzeebe.errors.zeebe_errors.UnkownGrpcStatusCodeError
problem was i had not passed the region parameter which was defaulting to bru-2.
camunda_region = os.environ.get('CAMUNDA_CLUSTER_REGION')
channel = create_camunda_cloud_channel(client_id=zeebe_client_id, client_secret=zeebe_client_secret, cluster_id=camundacloud_cluster_id,region=camunda_region)

Django Login suddenly stopped working - timing out

My Django project used to work perfectly fine for the last 90 days.
There has been no new code deployment during this time.
Running supervisor -> gunicorn to serve the application and to the front nginx.
Unfortunately it just stopped serving the login page (standard framework login).
I wrote a small view that checks if the DB connection is working and it comes up within seconds.
def updown(request):
from django.shortcuts import HttpResponse
from django.db import connections
from django.db.utils import OperationalError
status = True
# Check database connection
if status is True:
db_conn = connections['default']
try:
c = db_conn.cursor()
except OperationalError:
status = False
error = 'No connection to database'
else:
status = True
if status is True:
message = 'OK'
elif status is False:
message = 'NOK' + ' \n' + error
return HttpResponse(message)
This delivers back an OK.
But the second I am trying to reach /admin or anything else requiring the login, it times out.
wget http://127.0.0.1:8000
--2022-07-20 22:54:58-- http://127.0.0.1:8000/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response... 302 Found
Location: /business/dashboard/ [following]
--2022-07-20 22:54:58-- http://127.0.0.1:8000/business/dashboard/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response... 302 Found
Location: /account/login/?next=/business/dashboard/ [following]
--2022-07-20 22:54:58-- http://127.0.0.1:8000/account/login/? next=/business/dashboard/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
--2022-07-20 22:55:30-- (try: 2) http://127.0.0.1:8000/account/login/?next=/business/dashboard/
Connecting to 127.0.0.1:8000... connected.
HTTP request sent, awaiting response...
Supervisor / Gunicorn Log is not helpful at all:
[2022-07-20 23:06:34 +0200] [980] [INFO] Starting gunicorn 20.1.0
[2022-07-20 23:06:34 +0200] [980] [INFO] Listening at: http://127.0.0.1:8000 (980)
[2022-07-20 23:06:34 +0200] [980] [INFO] Using worker: sync
[2022-07-20 23:06:34 +0200] [986] [INFO] Booting worker with pid: 986
[2022-07-20 23:08:01 +0200] [980] [CRITICAL] WORKER TIMEOUT (pid:986)
[2022-07-20 23:08:02 +0200] [980] [WARNING] Worker with pid 986 was terminated due to signal 9
[2022-07-20 23:08:02 +0200] [1249] [INFO] Booting worker with pid: 1249
[2022-07-20 23:12:26 +0200] [980] [CRITICAL] WORKER TIMEOUT (pid:1249)
[2022-07-20 23:12:27 +0200] [980] [WARNING] Worker with pid 1249 was terminated due to signal 9
[2022-07-20 23:12:27 +0200] [1515] [INFO] Booting worker with pid: 1515
Nginx is just giving:
502 Bad Gateway
I don't see anything in the logs, I don't see any error when running the dev server from Django, also Sentry is not showing anything. Totally lost.
I am running Django 4.0.x and all libraries are updated.
The check up script for the database is only checking the connection. Due to misconfiguration of the database replication, the db was connecting and also reading, but when writing it hang.
The login page tries to write a session to the tables, which failed in this case.

How to send an email in AWS MWAA (Apache Airflow) using EmailOperator

I am working with AWS MWAA (Apache Airflow). I want to send an email in MWAA upon completion of my pipeline.I have set the following configuration
Now when I run my dag using an Email Operator, it gives me an error.
File "/usr/lib64/python3.7/socket.py", line 707, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/lib64/python3.7/socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
[2022-05-19, 11:11:21 UTC] {{local_task_job.py:154}} INFO - Task exited with return code 1
[2022-05-19, 11:11:21 UTC] {{local_task_job.py:264}} INFO - 0 downstream tasks scheduled from follow-on schedule check
Then I changed my configuration to
It now gives the following error
File "/usr/lib64/python3.7/smtplib.py", line 642, in auth
raise SMTPAuthenticationError(code, resp)
smtplib.SMTPAuthenticationError: (530, b'Must issue a STARTTLS command first')
[2022-05-19, 12:22:39 UTC] {{local_task_job.py:154}} INFO - Task exited with return code 1
[2022-05-19, 12:22:39 UTC] {{local_task_job.py:264}} INFO - 0 downstream tasks scheduled from follow-on schedule check
Can you please tell me where I am doing wrong or how should I configure this to send an email to a particular email address from any domain?
Your smtp host variable is an email address and not a host.
It should be smtp.gmail.com not smtp#gmail.com
You've hopefully also changed your password as you have shared it publicly in that screenshot and anyone could use it now.

Flask-sqlalchemy / uwsgi: DB connection problem when more than on process is used

I have a Flask app running on Heroku with uwsgi server in which each user connects to his own database. I have implemented the solution reported here for a very similar situation. In particular, I have implemented the connection registry as follows:
class DBSessionRegistry():
_registry = {}
def get(self, URI, **kwargs):
if URI not in self._registry:
current_app.logger.info(f'INFO - CREATING A NEW CONNECTION')
try:
engine = create_engine(URI,
echo=False,
pool_size=5,
max_overflow=5)
session_factory = sessionmaker(bind=engine)
Session = scoped_session(session_factory)
a_session = Session()
self._registry[URI] = a_session
except ArgumentError:
raise Exception('Error')
current_app.logger.info(f'SESSION ID: {id(self._registry[URI])}')
current_app.logger.info(f'REGISTRY ID: {id(self._registry)}')
current_app.logger.info(f'REGISTRY SIZE: {len(self._registry.keys())}')
current_app.logger.info(f'APP ID: {id(current_app)}')
return self._registry[URI]
In my create_app() I assign a registry to the app:
app.DBregistry = DBSessionRegistry()
and whenever I need to talk to the DB I call:
current_app.DBregistry.get(URI)
where the URI is dependent on the user. This works nicely if I use uwsgi with one single process. With more processes,
[uwsgi]
processes = 4
threads = 1
sometimes it gets stuck on some requests, returning a 503 error code. I have found that the problem appears when the requests are handled by different processes in uwsgi. This is an excerpt of the log, which I commented to illustrate the issue:
# ... EVERYTHING OK UP TO HERE.
# ALL PREVIOUS REQUESTS HANDLED BY PROCESS pid = 12
INFO in utils: SESSION ID: 139860361716304
INFO in utils: REGISTRY ID: 139860484608480
INFO in utils: REGISTRY SIZE: 1
INFO in utils: APP ID: 139860526857584
# NOTE THE pid IN THE NEXT LINE...
[pid: 12|app: 0|req: 1/1] POST /manager/_save_task =>
generated 154 bytes in 3457 msecs (HTTP/1.1 200) 4 headers in 601
bytes (1 switches on core 0)
# PREVIOUS REQUEST WAS MANAGED BY PROCESS pid = 12
# THE NEXT REQUEST IS FROM THE SAME USER AND TO THE SAME URL.
# SO THERE IS NO NEED FOR CREATING A NEW CONNECTION, BUT INSTEAD...
INFO - CREATING A NEW CONNECTION
# TO THIS POINT, I DON'T UNDERSTAND WHY IT CREATED A NEW CONNECTION.
# THE SESSION ID CHANGES, AS IT IS A NEW SESSION
INFO in utils: SESSION ID: 139860363793168 # <<--- CHANGED
INFO in utils: REGISTRY ID: 139860484608480
INFO in utils: REGISTRY SIZE: 1
# THE APP AND THE REGISTRY ARE UNIQUE
INFO in utils: APP ID: 139860526857584
# uwsgi GIVES UP...
*** HARAKIRI ON WORKER 4 (pid: 11, try: 1) ***
# THE FAILED REQUEST WAS MANAGED BY PROCESS pid = 11
# I ASSUME THIS IS WHY IT CREATED A NEW CONNECTION
HARAKIRI: -- syscall> 7 0x7fff4290c6d8 0x1 0xffffffff 0x4000 0x0 0x0
0x7fff4290c6b8 0x7f33d6e3cbc4
HARAKIRI: -- wchan> poll_schedule_timeout
HARAKIRI !!! worker 4 status !!!
HARAKIRI [core 0] - POST /manager/_save_task since 1587660997
HARAKIRI !!! end of worker 4 status !!!
heroku[router]: at=error code=H13 desc="Connection closed without
response" method=POST path="/manager/_save_task"
DAMN ! worker 4 (pid: 11) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 14)
# FROM HERE ON, NOTHINGS WORKS ANYMORE
This behavior is consistent over several attempts: when the pid changes, the request fails. Even with a pool_size = 1 in the create_engine function the issue persists. No issue instead is uwsgi is used with one process.
I am pretty sure it is my fault, there is something I don't know or I don't understand about how uwsgi and/or sqlalchemy work. Could you please help me?
Thanks
What is hapeening is that you are trying to share memory between processes.
There are some exaplanations in these posts.
(is it possible to share memory between uwsgi processes running flask app?).
(https://stackoverflow.com/a/45383617/11542053)
You can use an extra layer to store your sessions outsite of the app.
For that, you can use uWsgi's SharedArea(https://uwsgi-docs.readthedocs.io/en/latest/SharedArea.html) which is very low level or you can user other approaches like uWsgi's caching(https://uwsgi-docs.readthedocs.io/en/latest/Caching.html)
hope it helps.

Django/Celery Quickstart example not working (worker is not executing any tasks)

I'm using Django/Celery Quickstart... or, how I learned to stop using cron and love celery, and it seems the jobs are getting queued, but never run.
tasks.py:
from celery.task.schedules import crontab
from celery.decorators import periodic_task
# this will run every minute, see http://celeryproject.org/docs/reference/celery.task.schedules.html#celery.task.schedules.crontab
#periodic_task(run_every=crontab(hour="*", minute="*", day_of_week="*"))
def test():
print "firing test task"
So I run celery:
bash-3.2$ sudo manage.py celeryd -v 2 -B -s celery -E -l INFO
/scratch/software/python/lib/celery/apps/worker.py:166: RuntimeWarning: Running celeryd with superuser privileges is discouraged!
'Running celeryd with superuser privileges is discouraged!'))
-------------- celery#myserver v3.0.12 (Chiastic Slide)
---- **** -----
--- * *** * -- [Configuration]
-- * - **** --- . broker: django://localhost//
- ** ---------- . app: default:0x12120290 (djcelery.loaders.DjangoLoader)
- ** ---------- . concurrency: 2 (processes)
- ** ---------- . events: ON
- ** ----------
- *** --- * --- [Queues]
-- ******* ---- . celery: exchange:celery(direct) binding:celery
--- ***** -----
[Tasks]
. GotPatch.tasks.test
[2012-12-12 11:58:37,118: INFO/Beat] Celerybeat: Starting...
[2012-12-12 11:58:37,163: INFO/Beat] Scheduler: Sending due task GotPatch.tasks.test (GotPatch.tasks.test)
[2012-12-12 11:58:37,249: WARNING/MainProcess] /scratch/software/python/lib/djcelery/loaders.py:132: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!
warnings.warn("Using settings.DEBUG leads to a memory leak, never "
[2012-12-12 11:58:37,348: WARNING/MainProcess] celery#myserver ready.
[2012-12-12 11:58:37,352: INFO/MainProcess] consumer: Connected to django://localhost//.
[2012-12-12 11:58:37,700: INFO/MainProcess] child process calling self.run()
[2012-12-12 11:58:37,857: INFO/MainProcess] child process calling self.run()
[2012-12-12 11:59:00,229: INFO/Beat] Scheduler: Sending due task GotPatch.tasks.test (GotPatch.tasks.test)
[2012-12-12 12:00:00,017: INFO/Beat] Scheduler: Sending due task GotPatch.tasks.test (GotPatch.tasks.test)
[2012-12-12 12:01:00,020: INFO/Beat] Scheduler: Sending due task GotPatch.tasks.test (GotPatch.tasks.test)
[2012-12-12 12:02:00,024: INFO/Beat] Scheduler: Sending due task GotPatch.tasks.test (GotPatch.tasks.test)
The tasks are indeed getting queued:
python manage.py shell
>>> from kombu.transport.django.models import Message
>>> Message.objects.count()
234
And the count increases over time:
>>> Message.objects.count()
477
There are no lines in the log file that seem to indicate the task is being executed. I'm expecting something like:
[... INFO/MainProcess] Task myapp.tasks.test[39d57f82-fdd2-406a-ad5f-50b0e30a6492] succeeded in 0.00423407554626s: None
Any suggestions how to diagnose / debug this?
I'm new to celery as well, but from the comments on the link you provided, it looks like there was an error in the tutorial. One of the comments points out:
At this command
sudo ./manage.py celeryd -v 2 -B -s celery -E -l INFO
You must add "-I tasks" to load tasks.py file ...
Did you try that?
You should check that you specify BROKER_URL parameter inside django's settyngs.py.
BROKER_URL = 'django://'
And you should check that your timezones in django, mysql and celery is equal.
It helped me.
P.s.:
[... INFO/MainProcess] Task myapp.tasks.test[39d57f82-fdd2-406a-ad5f-50b0e30a6492] succeeded in 0.00423407554626s: None
This line means that your task was scheduled (!not executed!)
Please check your config and i hope that it helps you.
I hope someone could learn from my experience in hacking this.
After setting everything up according to the tutorial I noticed that when I call
add.delay(4,5)
nothing happens. the worker did not receive the task (nothing was printed on stderr).
The problem was with the rabbitmq installation. It turns out the default free disk size requirements is 1GB which was way too much for my VM.
what put me on track was to read the rabbitmq log file.
to find it I had to stop and start the rabbitmq server
sudo rabbitmqctl stop
sudo rabbitmq-server
rabbitmq dumps the log file location to the screen. in the file I noticed this:
=WARNING REPORT==== 14-Mar-2017::13:57:41 ===
disk resource limit alarm set on node rabbit#supporttip.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
I then followed the instruction here in order to reduce the free disk limit
Rabbitmq ignores configuration on Ubuntu 12
As a baseline I used the config file from git
https://github.com/rabbitmq/rabbitmq-server/blob/stable/docs/rabbitmq.config.example
The change itself:
{disk_free_limit, "50MB"}