We have a Django application that uses Django-river for workflow management. For performance improvement, we had to use bulk_create. We need to insert data into a couple of tables and several rows in each.
Initially, we were using the normal .save() method and the workflow was working as expected (as the post save() signals were creating properly). But once we moved to the bulk_create, the performance was improved from minutes to seconds. But the Django_river stopped working and there was no default post save signals. We had to implement the signals based on the documentation available.
class CustomManager(models.Manager):
def bulk_create(items,....):
super().bulk_create(...)
for i in items:
[......] # code to send signal
And
class Task(models.Model):
objects = CustomManager()
....
This got the workflow working again, but the generation of signals is taking time and this destroys all the performance improvement gained with bulk_create.
So is there a way to improve the signal creation?
More details
def post_save_fn(obj):
post_save.send(obj.__class__, instance=obj, created=True)
class CustomManager(models.Manager):
def bulk_create(self, objs, **kwargs):
#Your code here
data_obj = super(CustomManager, self).bulk_create(objs,**kwargs)
for i in data_obj:
# t1 = threading.Thread(target=post_save_fn, args=(i,))
# t1.start()
post_save.send(i.__class__, instance=i, created=True)
return data_obj
class Test(Base):
test_name = models.CharField(max_length=100)
test_code = models.CharField(max_length=50)
objects = CustomManager()
class Meta:
db_table = "test_db"
What is the problem?
As others have mentioned in the comments, the problem is that the functions that are getting called via the post_save are taking a long time. (Remember that signals are not async!! - this is a common misconception).
I'm not familiar with django-river but taking a quick look at the functions that will get called post-save (see here and here) we can see that they involve additional calls to the database.
Whilst you save a lot of individual db hits by using bulk_create you are still doing calling the database again multiple times for each post_save signal.
What can be done about it?
In short. Not much!! For the vast majority of django requests, the slow part will be calling the database. This is why we try and minimise the number of calls to the db (using things like bulk_create).
Reading through the first few paragraphs of django-river the whole idea is to move things that would normally be in code to the database. The big advantage here is that you don't need to re-write code and re-deploy so often. But the disadvantage is that you're inevitably going to have to refer to the database more, which is going to slow things down. This will be fine for some use-cases, but not all.
There are two things I can think of which might help:
Does all of this currently happen as part of the request/response cycle. And if it is, does it need to be? If the answers to these two questions are 'yes' and 'no' respectively, then you could move this work to a separate task queue. This will still be slow, but at least it won't slow down your site.
Depending on exactly what your workflows are and the nature of the data you are creating, it might be the case that you can do everything that the post_save signals are doing in your own function, and do it more efficiently. But this will definitely depend upon your data, and your app, and will move away from the philosophy of django-river.
Use a separated worker if the "signal" logic allows you to be executed after the bulk save.
You can create an additional queue table and put the metadata about what to do for your future worker.
Create a separated worker (Django module) with needed logic and data from the queue table. You can do it as management command, this will allow you to run the worker in the main flow (you can run management commands from regular Django code) or you can run it by crontab based on a schedule.
How to run such a worker?
If you need something to be done as closely as you've created records - run it in a separate thread using the threading module. So your request-response lifecycle will be done right after you've started a new thread.
Else if you can do it later - make a schedule and run it by crontab using the management command framework.
Related
I'm working on a web app developed with Django, which serves a Rest API in order to book tickets.
In order to do that, I've defined some views which interact with the database via the ORM.
My app has some critical functions, such as the booking function.
More or less, it has the following structure:
def book(params...):
# check ticket availability
# define some stuff which will we added to the new ticket entity
# save new ticket entity
I don't know how does Django manages concurrency, therefore, I'm worried about the possibility of checking the availability of two bookings at the same time while just having availability for one of them.
How likely is this to happen and which is the best approach to solve it? I've thought about defining that function as atomic, but I don't know how bad would it be for the system performance.
You should lock the resource exclusively until you are finished with it. If nobody else can acquire a lock on the object while you are working on it, you can be sure the object was not changed.
To acquire a lock on a resource we use a database lock.
And in django we use select_for_update() for database lock.
check this select_for_update()
For example:
entries = Entry.objects.select_for_update().filter(author=request.user)
All matched entries will be locked until the end of the transaction block, meaning that other transactions will be prevented from changing or acquiring locks on them.
You're right to be worried, you have a typical concurrency problem. Though Django already helps you by natively employing an atomic system in each view (each view is a transaction, and therefore is guaranteed to run entirely with success or do nothing in case of exceptions) this does not entirely solve your problem.
You must guarantee that each read > process > write process is run independently of others, so for example if you have multiple Django threads under the same application, when your "book" function is called you should avoid any other threads from entering this function (or at least the concurrency-sensitive parts). You can achieve this with concurrency controllers like semaphores or monitors. Take a look here and here.
Your code should be something like:
def book(params...):
LOCK()
# check ticket availability
# define some stuff which will we added to the new ticket entity
# save new ticket entity
UNLOCK()
In a database, i have a field called date. Is there a way to delete a row when the date passes, so that it doesnt show up anymore? Ive tried comparing it to todays date in the view, but this wouldnt happen everyday, and people would still see it on the first page load. Any ideas?
Removing something from your database is not safe for many reasons. Starting from permissions going to on_delete logic. If you are not sure about that it's totally required to delete something, just mark this row as active=false.
I would not recomend to use cron, since it hard to maintain: you have to set different tasks on different environments manually, copy these files somewhere on your VCS, work with bash instead of python.
Also, when talking about events, I would not recommend to store something like this in your database, since it is not controlled by VCS and hard to maintain.
If your app is pretty simple schedule is an option.
But if you are looking for some extra info like:
What rows were deleted?
Were there any exceptions?
You can move to more complex Celery with Beat turned on. Extra dependencies (like Redis, RabbitMQ) are the main disadvantage.
Docs:
celery beat
Related:
How do I get a Cron like scheduler in Python?
I believe the best way would be to use a Cron Job or to use a additional conditional in the view to show only rows after the said date.
I would recommend you use a mysql event, since this will run constantly, unlike triggers that are only fired on database operations. You want this to occur outside of anything happening in the application, just based on time, so mysql event will work for this scenario. See full tutorial here: http://www.sitepoint.com/working-with-mysql-events/
I had a easier approach, i guess you could call it "hard-coded". I made a function called deleteevent, which had the following code
def deleteevent():
yesterday = date.today() - timedelta(1)
if Events.objects.filter(event_date = yesterday).count():
Events.objects.filter(event_date = yesterday).delete()
Then, in every other function i had, i called this at the beginning, so the event would be deleted before the page loaded
I'm using Django 1.5.5.
Say I have an object as such:
class Encounter(model.Models):
date = models.DateTimeField(blank=True, null=True)
How can I detect when a given Encounter has reached current time ? I don't see how signals can help me.
You can't detect it using just Django.
You need some scheduler, that will check every Encounters date (for example, by using corresponding filter query), and do needed actions.
It can be a simple cron script. You can write it as django custom management command. And cron must call it every 5 minute, for example.
Or, you can use Celery. With it, you can see worker status from admin and do some other things.
What you could do is use Celery. When you save an object of Encounter this would then get into the task queue and execute only once it has reached current time.
There is one caveat though, it might execute a bit later depending on how busy the celery workers are.
The problem: a signal receiver checks to see if a model entry exists for certain conditions, and if not, it creates a new entry. In some rare circumstances, the entry is being duplicated.
Within the receiver function:
try:
my_instance = MyModel.objects.get(field1=value1, field2=sender)
except:
my_instance = MyModel(field1=value1, field2=sender)
my_instance.save()
It's an obvious candidate for get_or_create, but aside from cleaning up that code, would using get_or_create help prevent this problem?
The signal is sent after a user action, but I don't believe that the originating request is being duplicated because that would have trigged other actions.
The duplication has occurred a few times in thousands of instances. Is this necessarily caused by multiple requests or is there some way a duplicate thread could be created? And is there a way - perhaps with granular transaction management - to prevent the duplication?
Using Django 1.1, Python 2.4, PostgreSQL 8.1, and mod_wsgi on Apache2.
to prevent signals duplication add a "dispatch_uid" parameter to the signal attachment code as described in the docs.
make sure that you have a transaction opened - otherwise it may happen, that between checking (objects.get()) and cration (save()) state of the table changes.
Perhaps this answer may help. Apparently, a transaction is properly used with get_or_create but I've not confirmed this. mod_wsgi is multi-process and multi-threaded (both configurable), which means that race conditions can definitely occur. What I guess is happening in your application is that two separate requests are launched that will generate the same value for field1, and it just so happens that they execute with just the right timing to add 'duplicate' entries.
If the combination of MyModel(field1=value1, field2=sender) must be unique, then define a unique_together constraint on your model to further aide in integrity.
If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.
This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals
This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.
Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.
Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.
For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.
You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.
Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.
The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.
To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)
From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()