Runtime Error Deadlock occurring randomly in Django - django

Running Django in Python 3.7.9. I am using channels so daphne is used but even when I using gunicorn the same results are obtained.
The error below is appearing randomly.
ERROR 2021-07-12 11:55:07,478 HTTP GET /static/customer/assets/js/jquery.min.js 500 [0.71, 127.0.0.1:55466]
ERROR 2021-07-12 11:55:07,479 Exception inside application: Single thread executor already being used, would deadlock
Traceback (most recent call last):
File "/home/x/.pyenv/versions/3.7.9/lib/python3.7/site-packages/channels/http.py", line 192, in __call__
await self.handle(body_stream)
File "/home/x/.pyenv/versions/3.7.9/lib/python3.7/site-packages/asgiref/sync.py", line 410, in __call__
"Single thread executor already being used, would deadlock"
RuntimeError: Single thread executor already being used, would deadlock
I don't think this error has much to do with deadlock, but at times they are appearing together.
ERROR 2021-07-12 11:55:07,478 HTTP GET
/static/customer/assets/js/jquery.min.js 500 [0.71, 127.0.0.1:55466]
How can I resolve this?

I resolved this error by downgrading asgiref:
requirements.txt
asgiref==3.3.2

I solved this specific issues (occurred during fetch of static files) by running python manage.py collectstatic command. I assume, this way static files are not served by the same process. At least it solves my error.

Related

channels slows app and creates many HTTP 500 errors

I use channels to inform the frontend of my app to force a page update. What I discovered is, that it is much slower in debug mode now and also I have tons of HTTP 500 in my webconsole.
Occasionally I end up with:
ERROR:daphne.server:Exception inside application: Single thread executor already being used, would deadlock
Traceback (most recent call last):
File "...\venv\lib\site-packages\channels\staticfiles.py", line 40, in __call__
return await self.staticfiles_handler_class()(
File "...\venv\lib\site-packages\channels\staticfiles.py", line 56, in __call__
return await super().__call__(scope, receive, send)
File "...\venv\lib\site-packages\channels\http.py", line 198, in __call__
await self.handle(scope, async_to_sync(send), body_stream)
File "...\venv\lib\site-packages\asgiref\sync.py", line 382, in __call__
raise RuntimeError(
RuntimeError: Single thread executor already being used, would deadlock
And also all the HTTP 500 errors are usually some resources that can not be loaded - icons and other static files. Loading the page can last forever, but I remember for some time it worked just fine. I am using django-eventstream for creating my channels.
How would I find out what is slowing me down, or how can I prevent it? Is my problem (probably) similar to this one: Django and Channels and ASGI Thread Problem?

Celery with Redis and Django giving WorkerLostError on long running tasks

I have a long running Celery task that computes the PDP of a feature. Below is the shared task that's run:
#shared_task
def get_pdp_single(bst, train_df, feature, value, f_id=-1):
x_temp = train_df.copy()
x_temp.iloc[:, f_id] = value
data = xgb.DMatrix(x_temp, feature_names=x_temp.columns.tolist())
predictions = (bst.predict(data))
avg_predictions = np.mean(predictions)
result_dict = {
"feature": feature,
"avg_predictions": avg_predictions.item()
}
return result_dict
I'm computing Hstatistics of all the features taken in the XGBoost model built. So, we have lots of such tasks being queued in the Broker (Redis). ~12k tasks gets queued into Redis for this.
I have a 8core 16GB VM on which I instantiate a single Celery worker to do this task. Each single child task takes ~40 seconds to complete, this is because XGBoost predict method takes its time to complete.
On such long running task, I'm invariably getting WorkerLostErrors and it is quite unpredictable when and how this is occurring. However, I'm pretty sure this is because of the number of tasks being queued on the broker, because ~4-5k tasks run fine on the same setup without any issues.
Below is the stack trace that I get on Celery.
Restarting celery worker (/~/anaconda3/envs/py35_clone_canary/bin/celery -A ba_tpe_python_service worker -Q staging_celery_queue --loglevel=info)
Traceback (most recent call last):
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/loops.py", line 74, in asynloop
state.maybe_shutdown()
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/state.py", line 80, in maybe_shutdown
raise WorkerShutdown(should_stop)
celery.exceptions.WorkerShutdown: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/billiard-3.6.1.0-py3.5.egg/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 70.
I have also looked at multiple issues reported on the Github pages of Celery and Billiard. The solution have been mentioned as to take the latest version of Celery and Billiard. I have taken the latest master branch from their respective Git and built it in my environment but still facing the same issue
Celery version used: 4.4.0rc3
Billiard version used: 3.6.1.0
Please help me in debugging the issue.

What is the root cause of PyArrow HDFS IO error?

I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to determine the root cause of this issue, anyone have any ideas?
File "/extractor.py", line 87, in __call__
json.dump(results_dict, fp=_UTF8Encoder(f), indent=4)
File "pyarrow/io.pxi", line 72, in pyarrow.lib.NativeFile.__exit__
File "pyarrow/io.pxi", line 130, in pyarrow.lib.NativeFile.close
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS CloseFile failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
Turns out this was being cause by duplication of computation of "dask.get" tasks on Delayed objects, which was leading to multiple processes attempting to write to the same file.

redis.exceptions.LockError: Cannot release an unlocked lock after restarting celerybeat

sometimes after restarting celerybeat , I get the following error, I have setup celerybeat as a service with redis,
sude service celerybeat restart
Below is the exception trace
Traceback (most recent call last):
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/celery/beat.py", line 484, in start
time.sleep(interval)
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/celery/apps/beat.py", line 148, in _sync
beat.sync()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/celery/beat.py", line 493, in sync
self.scheduler.close()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/redbeat/schedulers.py", line 272, in close
self.lock.release()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/redis/lock.py", line 135, in release
self.do_release(expected_token)
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/redis/lock.py", line 264, in do_release
raise LockError("Cannot release a lock that's no longer owned")
redis.exceptions.LockError: Cannot release a lock that's no longer owned
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/celery/apps/beat.py", line 112, in start_scheduler
beat.start()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/celery/beat.py", line 490, in start
self.sync()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/celery/beat.py", line 493, in sync
self.scheduler.close()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/redbeat/schedulers.py", line 272, in close
self.lock.release()
File "/home/ec2-user/pyenv/local/lib/python3.4/site-packages/redis/lock.py", line 133, in release
raise LockError("Cannot release an unlocked lock")
redis.exceptions.LockError: Cannot release an unlocked lock
The exception does not happen every time and I have not noticed any issues caused by this, celerybeat works fine even after this exception. Since it is the production environment, I want to handle it safely.
I have noticed the same in my logs, for me the reason was the redis timeout was shorter than the task took, so it tried to release a lock after it expired, for example:
with redis_client.lock('some_key', timeout=5):
time.sleep(10)
which gives
redis.exceptions.LockError: Cannot release a lock that's no longer
owned

Debugging celery WorkerLostError with exitcode zero (Django 1.5.5 + celery 3.1.8 + RabbitMQ 3.1.3 on Heroku)

My platform runs through a lot of tasks (several thousand per day). Some of the longer tasks them keep failing with the following error:
Traceback (most recent call last):
File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 1167, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: exitcode 0.
According to Celery's Flower, which doesn't provide anything more than the posted traceback, the task was received ( 2014-12-22 22:46:46.196814 ) four minutes before it was started ( 2014-12-22 22:50:03.469647 ), and failed in just ten seconds (epoch 1419288613.34 or 2014-12-22 22:50:13 ).
This has been a recurring problem on my platform. It happens mostly with tasks which run scrapy 0.24.2 but it may also happen with other tasks.
Other durations of WorkerLostError (also with an exit code of zero) are three minutes, five minutes, or seven minutes.
Any thoughts on what could be causing this? All tasks run perfectly fine locally. Thanks.
My recommendation is to check all of the modules you are using and your code for 'raise BaseException'. I ran into the issue with WorkerLostError exitcode 0.
After a lot of debugging and figuring out specifically where tasks were failing, I found that it was when BaseException was raised. Instead of providing the error message, WorkerLostError occurred.
By changing to 'raise Exception', the actual error message was provided when something went wrong inside the task. This might not be the same for your case, but it was what I found when dealing with the same error.
I have also noticed the same error:
[...ERROR/MainProcess] Task ... raised unexpected: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/billiard/pool.py", line 1175, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
not only with a BaseException, but also a custom exception subclassed from BaseException. Changing the base class to Exception allowed the actual exception to be raised along with the stack trace.
for me it was sys.exit(0) inside the task