Background:
I'm working a project which uses Django with a Postgres database. We're also using mod_wsgi in case that matters, since some of my web searches have made mention of it. On web form submit, the Django view kicks off a job that will take a substantial amount of time (more than the user would want to wait), so we kick off the job via a system call in the background. The job that is now running needs to be able to read and write to the database. Because this job takes so long, we use multiprocessing to run parts of it in parallel.
Problem:
The top level script has a database connection, and when it spawns off child processes, it seems that the parent's connection is available to the children. Then there's an exception about how SET TRANSACTION ISOLATION LEVEL must be called before a query. Research has indicated that this is due to trying to use the same database connection in multiple processes. One thread I found suggested calling connection.close() at the start of the child processes so that Django will automatically create a new connection when it needs one, and therefore each child process will have a unique connection - i.e. not shared. This didn't work for me, as calling connection.close() in the child process caused the parent process to complain that the connection was lost.
Other Findings:
Some stuff I read seemed to indicate you can't really do this, and that multiprocessing, mod_wsgi, and Django don't play well together. That just seems hard to believe I guess.
Some suggested using celery, which might be a long term solution, but I am unable to get celery installed at this time, pending some approval processes, so not an option right now.
Found several references on SO and elsewhere about persistent database connections, which I believe to be a different problem.
Also found references to psycopg2.pool and pgpool and something about bouncer. Admittedly, I didn't understand most of what I was reading on those, but it certainly didn't jump out at me as being what I was looking for.
Current "Work-Around":
For now, I've reverted to just running things serially, and it works, but is slower than I'd like.
Any suggestions as to how I can use multiprocessing to run in parallel? Seems like if I could have the parent and two children all have independent connections to the database, things would be ok, but I can't seem to get that behavior.
Thanks, and sorry for the length!
Multiprocessing copies connection objects between processes because it forks processes, and therefore copies all the file descriptors of the parent process. That being said, a connection to the SQL server is just a file, you can see it in linux under /proc//fd/.... any open file will be shared between forked processes. You can find more about forking here.
My solution was just simply close db connection just before launching processes, each process recreate connection itself when it will need one (tested in django 1.4):
from django import db
db.connections.close_all()
def db_worker():
some_paralell_code()
Process(target = db_worker,args = ())
Pgbouncer/pgpool is not connected with threads in a meaning of multiprocessing. It's rather solution for not closing connection on each request = speeding up connecting to postgres while under high load.
Update:
To completely remove problems with database connection simply move all logic connected with database to db_worker - I wanted to pass QueryDict as an argument... Better idea is simply pass list of ids... See QueryDict and values_list('id', flat=True), and do not forget to turn it to list! list(QueryDict) before passing to db_worker. Thanks to that we do not copy models database connection.
def db_worker(models_ids):
obj = PartModelWorkerClass(model_ids) # here You do Model.objects.filter(id__in = model_ids)
obj.run()
model_ids = Model.objects.all().values_list('id', flat=True)
model_ids = list(model_ids) # cast to list
process_count = 5
delta = (len(model_ids) / process_count) + 1
# do all the db stuff here ...
# here you can close db connection
from django import db
db.connections.close_all()
for it in range(0:process_count):
Process(target = db_worker,args = (model_ids[it*delta:(it+1)*delta]))
When using multiple databases, you should close all connections.
from django import db
for connection_name in db.connections.databases:
db.connections[connection_name].close()
EDIT
Please use the same as #lechup mentionned to close all connections(not sure since which django version this method was added):
from django import db
db.connections.close_all()
For Python 3 and Django 1.9 this is what worked for me:
import multiprocessing
import django
django.setup() # Must call setup
def db_worker():
for name, info in django.db.connections.databases.items(): # Close the DB connections
django.db.connection.close()
# Execute parallel code here
if __name__ == '__main__':
multiprocessing.Process(target=db_worker)
Note that without the django.setup() I could not get this to work. I am guessing something needs to be initialized again for multiprocessing.
I had "closed connection" issues when running Django test cases sequentially. In addition to the tests, there is also another process intentionally modifying the database during test execution. This process is started in each test case setUp().
A simple fix was to inherit my test classes from TransactionTestCase instead of TestCase. This makes sure that the database was actually written, and the other process has an up-to-date view on the data.
Another way around your issue is to initialise a new connection to the database inside the forked process using:
from django.db import connection
connection.connect()
(not a great solution, but a possible workaround)
if you can't use celery, maybe you could implement your own queueing system, basically adding tasks to some task table and having a regular cron that picks them off and processes? (via a management command)
Hey I ran into this issue and was able to resolve it by performing the following (we are implementing a limited task system)
task.py
from django.db import connection
def as_task(fn):
""" this is a decorator that handles task duties, like setting up loggers, reporting on status...etc """
connection.close() # this is where i kill the database connection VERY IMPORTANT
# This will force django to open a new unique connection, since on linux at least
# Connections do not fare well when forked
#...etc
ScheduledJob.py
from django.db import connection
def run_task(request, job_id):
""" Just a simple view that when hit with a specific job id kicks of said job """
# your logic goes here
# ...
processor = multiprocessing.Queue()
multiprocessing.Process(
target=call_command, # all of our tasks are setup as management commands in django
args=[
job_info.management_command,
],
kwargs= {
'web_processor': processor,
}.items() + vars(options).items()).start()
result = processor.get(timeout=10) # wait to get a response on a successful init
# Result is a tuple of [TRUE|FALSE,<ErrorMessage>]
if not result[0]:
raise Exception(result[1])
else:
# THE VERY VERY IMPORTANT PART HERE, notice that up to this point we haven't touched the db again, but now we absolutely have to call connection.close()
connection.close()
# we do some database accessing here to get the most recently updated job id in the database
Honestly, to prevent race conditions (with multiple simultaneous users) it would be best to call database.close() as quickly as possible after you fork the process. There may still be a chance that another user somewhere down the line totally makes a request to the db before you have a chance to flush the database though.
In all honesty it would likely be safer and smarter to have your fork not call the command directly, but instead call a script on the operating system so that the spawned task runs in its own django shell!
If all you need is I/O parallelism and not processing parallelism, you can avoid this problem by switch your processes to threads. Replace
from multiprocessing import Process
with
from threading import Thread
The Thread object has the same interface as Procsess
If you're also using connection pooling, the following worked for us, forcibly closing the connections after being forked. Before did not seem to help.
from django.db import connections
from django.db.utils import DEFAULT_DB_ALIAS
connections[DEFAULT_DB_ALIAS].dispose()
One possibility is to use multiprocessing spawn child process creation method, which will not copy django's DB connection details to the child processes. The child processes need to bootstrap from scratch, but are free to create/close their own django DB connections.
In calling code:
import multiprocessing
from myworker import work_one_item # <-- Your worker method
...
# Uses connection A
list_of_items = djago_db_call_one()
# 'spawn' starts new python processes
with multiprocessing.get_context('spawn').Pool() as pool:
# work_one_item will create own DB connection
parallel_results = pool.map(work_one_item, list_of_items)
# Continues to use connection A
another_db_call(parallel_results)
In myworker.py:
import django. # <-\
django.setup() # <-- needed if you'll make DB calls in worker
def work_one_item(item):
try:
# This will create a new DB connection
return len(MyDjangoModel.objects.all())
except Exception as ex:
return ex
Note that if you're running the calling code inside a TestCase, mocks will not be propagated to the child processes (will need to re-apply them).
You could give more resources to Postgre, in Debian/Ubuntu you can edit :
nano /etc/postgresql/9.4/main/postgresql.conf
by replacing 9.4 by your postgre version .
Here are some useful lines that should be updated with example values to do so, names speak for themselves :
max_connections=100
shared_buffers = 3000MB
temp_buffers = 800MB
effective_io_concurrency = 300
max_worker_processes = 80
Be careful not to boost too much these parameters as it might lead to errors with Postgre trying to take more ressources than available. Examples above are running fine on a Debian 8GB Ram machine equiped with 4 cores.
Overwrite the thread class and close all DB connections at the end of the thread. Bellow code works for me:
class MyThread(Thread):
def run(self):
super().run()
connections.close_all()
def myasync(function):
def decorator(*args, **kwargs):
t = MyThread(target=function, args=args, kwargs=kwargs)
t.daemon = True
t.start()
return decorator
When you need to call a function asynchronized:
#myasync
def async_function():
...
Related
Currently i have a celery batch running with django like so:
Celery.py:
from __future__ import absolute_import, unicode_literals
import os
import celery
from celery import Celery
from celery.schedules import crontab
import django
load_dotenv(os.path.join(os.path.dirname(os.path.dirname(__file__)), '.env'))
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'base.settings')
django.setup()
app = Celery('base')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
app.control.purge()
sender.add_periodic_task(30.0, check_loop.s())
recursion_function.delay() #need to use recursive because it need to wait for loop to finish(time can't be predict)
print("setup_periodic_tasks")
#app.task()
def check_loop():
.....
start = database start number
end = database end number
callling apis in a list from id=start to id=end
create objects
update database(start number = end, end number = end + 3)
....
#app.task()
def recursion_function(default_retry_delay=10):
.....
do some looping
....
#when finished, call itself again
recursion_function.apply_async(countdown=30)
My aim is whenever the celery file get edited then it would restart all the task -remove queued task that not yet execute(i'm doing this because recursion_function will run itself again if it finished it's job of checking each record of a table in my database so i'm not worry about it stopping mid way).
The check_loop function will call to an api that has paging to return a list of objects and i will compare it to by record in a table , if match then create a new custom record of another model
My question is when i purge all messages, will the current running task get stop midway or it gonna keep running ? because if the check_loop function stop midway looping through the api list then it will run the loop again and i will create new duplicate record which i don't want
EXAMPLE:
during ruuning task of check_loop() it created object midway (on api list from element id=2 to id=5), server restart -> run again, now check_loop() run from beginning(on api list from element id=2 to id=5) and created object from that list again(which 100% i don't want)
Is this how it run ? i just need a confirmation
EDIT:
https://docs.celeryproject.org/en/4.4.1/faq.html#how-do-i-purge-all-waiting-tasks
I added app.control.purge() because when i restart then recursion_function get called again in setup_periodic_tasks while previous recursion_function from recursion_function.apply_async(countdown=30) execute too so it multiplied itself
Yes, worker will continue execution of currently running task unless worker is also restarted.
Also, The Celery Way is to always expect tasks to be run in concurrent environment with following considerations:
there are many tasks running concurrently
there are many celery workers executing tasks
same task may run again
multiple instances of the same task may run at the same moment
any task may be terminated any time
even if you are sure that in your environment there is only one worker started / stopped manually and these do not apply - tasks should be created in such way to allow everything of this to happen.
Some useful techniques:
use database transactions
use locking
split long-running tasks into faster ones
if task has intermediate values to be saved or they are important (i.e. non-reproducible like some api calls) and their processing in next step takes time - consider splitting into several chained tasks
If you need to run only one instance of a task at a time - use some sort of locking - create / update lock-record in the database or in the cache so others (same tasks) can check and know this task is running and just return or wait for previous one to complete.
I.e. recursion_function can also be Periodic Task. Being periodic task will make sure it is run every interval, even if previous one fails for any reason (and thus fails to queue itself again as in regular non-periodic task). With locking you can make sure only one is running at a time.
check_loop():
First, it is recommended to save results in one transaction in the database to make sure all or nothing is saved / modified in the database.
You can also save some marker that indicates how many / status of saved objects, so future tasks can just check this marker, not each object.
Or somehow perform check for each element before creating it that it already exists in the database.
I am not going to write an essay like Oleg's excellent post above. The answer is simply - all running tasks will continue running. purge is all about the tasks that are in the queue(s), waiting to be picked by Celery workers.
I am using celery with Django. Redis is my broker. I am serving my Django app via Apache and WSGI. I am running celery in supervisor mode. I am starting up a celery task named run_forever from wsgi.py file of my Django project. My intention was to start a celery task when Django starts up and run it forever in the background (I don't know if it is the right way to achieve the same. I searched it but couldn't find appropriate implementation. If you have any better idea, kindly share). It is working as expected. Now due to certain issue, I have added maximum-requests-250 parameter in the virtual host of apache. By doing so when it gets 250 requests it restarts the WSGI process.
So when every time it restarts a celery task 'run_forever' is created and run in the background. Eventually, when the server gets 1000 requests WSGI process would have restarted 4 times and I end in having 4 copies of 'run_forever' task. I only want to have one copy of the task to run at any point in time. So I would like to kill all the currently running 'run_forever' task every time the Django starts.
I have tried
from project.celery import app
from project.tasks import run_forever
app.control.purge()
run_forever.delay()
in wsgi.py to kill all the running tasks before starting `run_forever'. But didn't work
I have to agree with Dave Smith here--why do you have a task that runs forever? That said, to the extent that you want to safeguard a task from running twice, there are multiple strategies you can use. The easiest for implementation is using a database entry (since databases can be transactional and if you re using django, presumably you are using at least one database). n.b., in the code snippet below, I did not put my model in the right place to be picked up by a migration--I just put it in the same snippet for ease of discussion.
import time
from myapp.celery import app
from django.db import models
class CeleryGuard(models.Model):
task_name = models.CharField(max_length=32)
task_id = models.CharField(max_length=32)
#app.task(bind=True)
def run_forever(self):
created, x = CeleryGuard.objects.get_or_create(
task_name='run_forever', defaults={
'task_id': self.request.id
})
if not created:
return
# do whatever you want to here
while True:
print('I am doing nothing')
time.sleep(1440)
# make sure to cleanup after you are done
CeleryGuard.objects.filter(task_name='run_forever').delete()
When a Django test case runs, it creates an isolated test database so that database writes get rolled back when each test completes. I am trying to create an integration test with Celery, but I can't figure out how to connect Celery to this ephemeral test database. In the naive setup, Objects saved in Django are invisible to Celery and objects saved in Celery persist indefinitely.
Here is an example test case:
import json
from rest_framework.test import APITestCase
from myapp.models import MyModel
from myapp.util import get_result_from_response
class MyTestCase(APITestCase):
#classmethod
def setUpTestData(cls):
# This object is not visible to Celery
MyModel(id='test_object').save()
def test_celery_integration(self):
# This view spawns a Celery task
# Task should see MyModel.objects.get(id='test_object'), but can't
http_response = self.client.post('/', 'test_data', format='json')
result = get_result_from_response(http_response)
result.get() # Wait for task to finish before ending test case
# Objects saved by Celery task should be deleted, but persist
I have two questions:
How do make it so that Celery can see the objects that the Django test case?
How do I ensure that all objects saved by Celery are automatically rolled back once the test completes?
I am willing to manually clean up the objects if doing this automatically is not possible, but a deletion of objects in tearDown even in APISimpleTestCase seems to be rolled back.
This is possible by starting a Celery worker within the Django test case.
Background
Django's in-memory database is sqlite3. As it says on the description page for Sqlite in-memory databases, "[A]ll database connections sharing the in-memory database need to be in the same process." This means that, as long as Django uses an in-memory test database and Celery is started in a separate process, it is fundamentally impossible to have Celery and Django to share a test database.
However, with celery.contrib.testing.worker.start_worker, it possible to start a Celery worker in a separate thread within the same process. This worker can access the in-memory database.
This assumes that Celery is already setup in the usual way with the Django project.
Solution
Because Django-Celery involves some cross-thread communication, only test cases that don't run in isolated transactions will work. The test case must inherit directly from SimpleTestCase or its Rest equivalent APISimpleTestCase and set databases to '__all__' or just the database that the test interacts with.
The key is to start a Celery worker in the setUpClass method of the TestCase and close it in the tearDownClass method. The key function is celery.contrib.testing.worker.start_worker, which requires an instance of the current Celery app, presumably obtained from mysite.celery.app and returns a Python ContextManager, which has __enter__ and __exit__ methods, which must be called in setUpClass and tearDownClass, respectively. There is probably a way to avoid manually entering and existing the ContextManager with a decorator or something, but I couldn't figure it out. Here is an example tests.py file:
from celery.contrib.testing.worker import start_worker
from django.test import SimpleTestCase
from mysite.celery import app
class BatchSimulationTestCase(SimpleTestCase):
databases = '__all__'
#classmethod
def setUpClass(cls):
super().setUpClass()
# Start up celery worker
cls.celery_worker = start_worker(app, perform_ping_check=False)
cls.celery_worker.__enter__()
#classmethod
def tearDownClass(cls):
super().tearDownClass()
# Close worker
cls.celery_worker.__exit__(None, None, None)
def test_my_function(self):
# my_task.delay() or something
For whatever reason, the testing worker tries to use a task called 'celery.ping', probably to provide better error messages in the case of worker failure. The task it is looking for is celery.contrib.testing.tasks.ping, which is not available at test time. Setting the perform_ping_check argument of start_worker to False skips the check for this and avoids the associated error.
Now, when the tests are run, there is no need to start a separate Celery process. A Celery worker will be started in the Django test process as a separate thread. This worker can see any in-memory databases, including the default in-memory test database. To control the number of workers, there are options available in start_worker, but it appears the default is a single worker.
For your unittests I would recommend skipping the celery dependency, the two following links will provide you with the necesarry infos to start your unittests:
http://docs.celeryproject.org/projects/django-celery/en/2.4/cookbook/unit-testing.html
http://docs.celeryproject.org/en/latest/userguide/testing.html
If you really want to test the celery function calls including a queue I'd propably set up a dockercompose with the server, worker, queue combination and extend the custom CeleryTestRunner from the django-celery docs. But I wouldn't see a benefit from it because the test system is pbly to far away from production to be representative.
I found another workaround for the solution based on #drhagen's one:
Call celery.contrib.testing.app.TestApp() before calling start_worker(app)
from celery.contrib.testing.worker import start_worker
from celery.contrib.testing.app import TestApp
from myapp.tasks import app, my_task
class TestTasks:
def setup(self):
TestApp()
self.celery_worker = start_worker(app)
self.celery_worker.__enter__()
def teardown(self):
self.celery_worker.__exit__(None, None, None)
We are trying to use PyMySQL (==0.7.11) in our Django (==1.11.4) environment. But we are encountered with problems when multiple actions are performed at the same time (Multiple requests sent to the same API function).
We get this error:
pymysql.err.InternalError: Packet sequence number wrong - got 6
expected 1
We are trying to delete records from the DB (some time massive requests come from multiple users).
Code:
def delete(self, delete_query):
self.try_reconnect()
return self._execute_query(delete_query)
def try_reconnect(self):
if not self.is_connected:
self.connection.ping(reconnect=True)
#property
def is_connected(self)
try:
self.connection.ping(reconnect=False)
return True
execpt:
return False
def _execute_query(self, query):
with self.connection.cursor() as cursor:
cursor.execute(query)
self.connection.commit()
last_row_id = cursor.lastrowid
return last_row_id
I didn't think it necessary to point out that those functions are part of DBHandler class,
and self.connection initialized in def connect(self) function.
def connect(self):
self.connection = pymysql.connect(...)
This connect function run once in Django startup, we create a global instance(varaible) of DBHandler for the whole project, and multiple files importing it.
We are using the delete function as our gateway to execute delete query.
What we are doing wrong ? And how can we fix it ?
Found the problem,
PyMySQL is not thread safty to share connections as we did (we shared the class instance between multiple files as a global instance - in the class there is only one connection), it is labled as 1:
threadsafety = 1
According to PEP 249:
1 - Threads may share the module, but not connections.
One of the comments in PyMySQL github issue:
you need one pysql.connect() for each process/thread. As far as I know that's the only way to fix it. PyMySQL is not thread safe, so the same connection can't be used across multiple threads.
Any way if you were thinking of using other python package called MySQLdb for your threading application, notice to MySQLdb message:
Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
For threaded applications, try using a connection pool. This can be
done using the Pool module.
Eventually we managed to use Django ORM and we are writing only for our specific table, managed by using inspectdb.
I've noticed that on occasions where I've run my django project without the PostgreSQL server available, the errors produced are fairly cryptic and often appear to be generated by deep django internals as these are the functions actually connecting to the backend.
Is there a simple clean(and DRY) way to test the server is running.
Where is the best place to put project level start up checks?
You can register a signal on class-prepared.
https://docs.djangoproject.com/en/dev/ref/signals/#class-prepared
Than try executing custom sql directly.
https://docs.djangoproject.com/en/dev/topics/db/sql/#executing-custom-sql-directly
If it fails raise your custom exception.
import time
from django.db import connections
from django.db.utils import OperationalError
self.stdout.write('Waiting for database...')
db_conn = None
while not db_conn:
try:
db_conn = connections['default']
except OperationalError:
self.stdout.write('Database unavailable, waiting 1 second...')
time.sleep(1)
self.stdout.write(self.style.SUCCESS('Database available!'))
you can use this snippet where you need to.
befor accessing database and making any queries you must check if the database is up or not