There is a lot of topics on Django concurrency, but after checking a lot of those, I don't feel I have found my answer when it comes to transactions.
Django version 1.3.1. Postgresql version 8.4.7.
A very simple version of my models could look like this:
def Member(Model):
money = PositiveIntegerField(default=0)
user = OneToOneField(User, related_name='member', primary_key=True)
def Bet(Model):
total_money = PositiveIntegerField(default=0)
I also have a table Money which is a relation between Member and Bet. It's not directly linked to my problem but it helps me monitor it, because it can't be impacted by any concurrency issues. i.e I just have to count my table Money to test if the fields money of Member and total_money of Bet are correct.
I can't rely only on the table Money though, and I need my fields to be correct, because I filter a lot using them.
My first try for the bet function was something like this (just with a lot more modifications to a lot more tables).
def bid(user_pk, bet_pk, value):
#create Money object
member = User.objects.get(user_pk).member
member.money = F('money') - value
member.save()
bet = Bet.objects.get(bet_pk)
bet.total_money = F('total_money') + value
bet.save()
This version was working just fine until I get my first crash during one transaction.
I had also to copy paste all the tests from my clean() functions in bid(), because I'm not really able to use clean() or full_clean() in this case (especially if bet raises, after member is saved).
So I decided to give a try to django transaction.
#transaction.commit_manually
def bid(user_pk, bet_pk, value):
try:
#create money object
member = User.objects.get(user_pk).member
member.money -= value
member.clean()
member.save()
bet = Bet.objects.get(bet_pk)
bet.total_money += value
bet.clean()
bet.save()
except:
transaction.rollback()
raise
else:
transaction.commit()
But without the possibility to use F() object inside of manual transaction (which makes sense). I ended up with a lot of concurrency issues.
I see only two solutions:
Only create Money objects during the bid()/transaction, then have an asynchronous worker (Celery ?) that updates the related fields in Member and Bet.
Create a list of bid()/transaction (Redis ?), and make all transactions that modify money related fields synchronous.
Am I missing an obvious and easier solution ?
If not, what solution would you recommend using which technology ?
would this work?
#transaction.commit_on_success
def bid(user_pk, bet_pk, value):
Member.objects.filter(user__pk=user_pk).update(money=F('money') - value)
Bet.objects.filter(pk=bet_pk).update(total_money=F('total_money') + value)
Related
This are simplified models to demonstrate my problem:
class User(models.Model):
username = models.CharField(max_length=30)
total_readers = models.IntegerField(default=0)
class Book(models.Model):
author = models.ForeignKey(User)
title = models.CharField(max_length=100)
class Reader(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
So, we have Users, Books and Readers (Users, who have read a Book). Thus, Reader is basically a many-to-many relationship between Book and User.
Now let's say, the current user reads a book. Now, I'd like to update the number of total readers for all books of this book's author:
# get the book (as an example pk=1)
book = Book.objects.get(pk=1)
# save Reader object for this user and this book
Reader(user=request.user, book=book).save()
# count and save the total number of readers for this author in all his books
book.author.total_readers = Reader.objects.filter(book__author=book.author).count()
book.author.save()
By doing so, Django creates a LEFT OUTER JOIN query for PostgreSQL and we get the expected result. However, the database tables are huge and this has become a bottleneck.
In this example, we could simply increase the total_readers by one on each view, instead of actually counting the database rows. However, this is just a simplified model structure and we cannot do this in reality here.
What I can do, is creating another field in the Reader model called book_author_id. Thus, I denormalize data and can count the Reader objects without having PostgreSQL making the LEFT OUTER JOIN with the User table.
Finally, here's my question: Is it possible to create some sort of database index, so that PostgreSQL handles this denormalization automatically? Or do I really have to create this additional model field and redundantly store the author's PK in there?
EDIT - to point out the essential question: I got several great answers, which work for a lot of scenarios. However, they don't solve this actual problem. The only thing I'd like to know, is if it's possible to have PostgreSQL handle such a denormalization automatically - e.g. by creating some sort of database index.
Sometimes, this query can serve better:
book.author.total_readers = Reader.objects.filter(book__in=Book.objects.filter(author=book.author)).count()
That will generate query with sub-query, sometimes it will have better performance that query with join. You even go further and end up creating 2 queries separately:
book.author.total_readers = Reader.objects.filter(book_id__in=Book.objects.filter(author=book.author).values_list('id', flat=True)).count()
That will generate 2 queries, one will retrieve list of all book IDs for that author and second will retrieve count of reads for books with ID in that list.
Good solution also may be to create some batch task that will run for example once per hour and count up all reads, but that way you will end up with not live refreshing count of reads.
You can also create celery task that will run just after read is created to generate new value for author. That way you won't have long response time and delay from creating read to counting it up won't be so long.
It's always way better to solve bottlenecks of this sort with good design and maybe a little bit of caching rather than duplicating data in the way you suggest. The total_readers field is data you should generate instead of recording.
class User(models.Model):
username = models.CharField(max_length=30)
#property
def total_readers(self):
cached_value = caching_client.get("readers_"+self.username, None)
if cached_value is None:
cached_value = self.readers()
caching_client.set("readers_"+self.username,
cached_value)
return cached_value
def readers(self):
return Reader.objects.filter(book__author__user=self).count()
There are libraries that do the caching via decorators but I felt it was a pattern you would benefit from seeing expressly. You can also attach a TTL to the cache so that you insure that the value can't be wrong for longer than TTL. You can also regenerate the cache upon creation of a Reader object.
You might actually get some mileage with declaring an m2m and defining through relationships but I have no experience of it.
I have an import of objects where I want to check against the database if it has already been imported earlier, if it has I will update it, if not I will create a new one. But what is the best way of doing this.
Right now I have this:
old_books = Book.objects.filter(foreign_source="import")
for book in new_books:
try:
old_book = old_books.get(id=book.id):
#update book
except:
#create book
But that creates a database call for each book in new_books. So I am looking for a way where it will only make one call to the database, and then just fetch objects from that queryset.
Ps: not looking for a get_or_create kind of thing as the update and create functions are more complex than that :)
--- EDIT---
I guess I haven't been good enough in my explanation, as the answers does not reflect what the problem is. So to make it more clear (I hope):
I want to pick out a single object from a queryset, based on an id of that object. I want the full object so I can update it and save it with it's changed values. So lets say I have a queryset with 3 objects, A and B and C. Then I want a way to ask if the queryset has object B and if it has then get it, without an extra database call.
Assuming new_books is another queryset of Book you can try filter on id of it as
old_books = Book.objects.filter(foreign_source="import").filter(id__in=[b.id for b in new_books])
With this old_books has books that are already created.
You can use the values_list('id', flat=True) to get all ids in a single DB call (is much faster than querysets). Then you can use sets to find the intersections.
new_book_ids = new_books.values_list('id', flat=True)
old_book_ids = Book.objects.filter(foreign_source="import") \
.values_list('id', flat=True)
to_update_ids = set(new_book_ids) & set(old_book_ids)
to_create_ids = set(new_book_ids) - to_update_ids
-- EDIT (to include the updated part) --
I guess the problem you are facing is in bulk updating rather than bulk fetch.
If the updates are simple, then something like this might work:
old_book_ids = Book.objects.filter(foreign_source="import") \
.values_list('id', flat=True)
to_update = []
to_create = []
for book in new_books:
if book.id in old_book_ids:
# list of books to update
# to_update.append(book.id)
else:
# create a book object
# Book(**details)
# Update books
Book.objects.filter(id__in=to_update).update(field='new_value')
Book.objects.bulk_create(to_create)
But if the updates are complex (update fields are dependent upon related fields), then you can check insert... on duplicated key update option in MySQL and its custom manager for Django.
Please leave a comment if the above is completely off the track.
You'll have to do more than one query. You need two groups of objects, you can't fetch them both and split them up at the same time arbitrarily like that. There's no bulk_get_or_create method.
However, the example code you've given will do a query for every object which really isn't very efficient (or djangoic for that matter). Instead, use the __in clause to create smart subqueries, and then you can limit database hits to only two queries:
old_to_update = Book.objects.filter(foreign_source="import", pk__in=new_books)
old_to_create = Book.objects.filter(foreign_source="import").exclude(pk__in=new_books)
Django is smart enough to know how to use that new_books queryset in that context (it can also be a regular list of ids)
update
Queryset objects are just a sort of list of objects. So all you need to do now is loop over the objects:
for book in old_to_update:
#update book
for book in old_to_create:
#create book
At this point, when it's fetching the books from the QuerySet, not from the databse, which is a lot more efficient than using .get() for each and every one of them - and you get the same result. each iteration you get to work with an object, the same as if you got it from a direct .get() call.
The best solution I have found is using the python next() function.
First evaluate the queryset into a set and then pick the book you need with next:
old_books = set(Book.objects.filter(foreign_source="import"))
old_book = next((book for book in existing_books if book.id == new_book.id), None )
That way the database is not queried everytime you need to get a specific book from the queryset. And then you can just do:
if old_book:
#update book
old_book.save()
else:
#create new book
In Django 1.7 there is an update_or_create() method that might solve this problem in a better way: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.update_or_create
I have two models
class Subject(models.Model):
name = models.CharField(max_length=100,choices=COURSE_CHOICES)
created = models.DateTimeField('created', auto_now_add=True)
modified = models.DateTimeField('modified', auto_now=True)
syllabus = models.FileField(upload_to='syllabus')
def __unicode__(self):
return self.name
and
class Pastquestion(models.Model):
subject=models.ForeignKey(Subject)
year =models.PositiveIntegerField()
questions = models.FileField(upload_to='pastquestions')
def __unicode__(self):
return str(self.year)
Each Subject can have one or more past questions but a past question can have only one subject. I want to get a subject, and get its related past questions of a particular year. I was thinking of fetching a subject and getting its related past question.
Currently am implementing my code such that I rather get the past question whose subject and year correspond to any specified subject like
this_subject=Subject.objects.get(name=the_subject)
thepastQ=Pastquestion.objects.get(year=2000,subject=this_subject)
I was thinking there is a better way to do this. Or is this already a better way? Please Do tell ?
I think what you want is the related_name property of the ForeignKey field. This creates a link back to the Subject object and provides a manager you can use to query the set.
So to use this functionality, change the foreignkey line to:
subject=models.ForeignKey(Subject, related_name='questions')
Then with an instance of Subject we'll call subj, you can:
subj.questions.filter(year=2000)
I don't think this performs much differently to the technique you have used. Roughly speaking, SQL performance boils down a) whether there's an index and b) how many queries you're issuing. So you need to think about both. One way to find out what SQL your model usage is generating is to use SqlLogMiddleware - and alternatively play with the options in How to show the SQL Django is running It can be tempting when you get going to start issuing queries across relationships - e.g. q = Question.objects.get(year=2000, subject__name=SUBJ_MATHS) but unless you keep a close eye on these types of queries, you can and will kill your app's performance, badly.
Django's query syntax allows you to 'reach into' related objects.
past_questions = Pastquestion.objects.filter(year=2000, subject__name=subject_name)
I am implementing an eauction toy-app in Django and am confused on how to best handle concurrency in the code below. I am uncertain which of my solution candidates (or any other) fits best with the design of Django. I am fairly new to Django/python and my SQL know-how is rusty so apologies if this is a no-brainer.
Requirement: Users may bid on products. Bids are only accepted if they are higher than the previous bids on the same product.
Here is a stripped down version of the models:
class Product(models.Model):
name = models.CharField(max_length=20)
class Bid(models.Model):
amount = models.DecimalField(max_digits=5, decimal_places=2)
product = models.ForeignKey(Product)
and the bid view. This is where the race conditions occur (see comments):
def bid(request, product_id):
p = get_object_or_404(Product, pk=product_id)
form = BidForm(request.POST)
if form.is_valid():
amount = form.cleaned_data['amount']
# the following code is subject to race conditions
highest_bid_amount = Bid.objects.filter(product=product_id).aggregate(Max('amount')).get('amount__max')
# race condition: a bid might have been inserted just now by another thread so highest_bid_amount is already out of date
if (amount > highest_bid_amount):
bid = Bid(amount=amount, product_id=product_id)
# race condition: another user might have just bid on the same product with a higher amount so the save() below is incorrect
b.save()
return HttpResponseRedirect(reverse('views.successul_bid)'
Solution candidates I considered so far:
I have read the Django doc about transactions but I wouldn't know how to apply them to my problem. Since the database does not know about the requirement that bids must be ascending it cannot cause Django to throw an IntegrityError. Is there a way to define this constraint during model definition? Or did it misunderstand the transaction API?
A stored procedure could take care of the bid logic. This is seems to me the "best" choice so far but it shifts handling the race condition to the underlying database system. If this is a good approach, though, this solution might be combined with solution 1?
I considered using a select_for_update call to lock the bids for this product. However, this does not seem to be a solution since in my understanding it would not affect any new bids being created?
Wish list:
If in any way possible, I would like to refrain from locking the entire bid table, since bids on other products can not be affected anyway.
If there is a good solution on application level, I would like to keep the code independent from the underlying database system.
Many thanks for your thoughts!
Would it be possible for you to add a highest_bid column to Products. If my logic is not off, you could then update the highest bid where product_id = x and highest < current_bid. If this query indicates that a row has been updated then you add the new record to the bid table. This would probably mean that you would have to have a default value for highest_bid column.
Have you checked out Celery? You might process your queries asynchronously, queuing the queries and then handing results or errors back when they're available. That seems like a likely path to take if you want to avoid locking.
Otherwise, it does seem like some locking would need to occur.
I have developed a few Django apps, all pretty straight-forward in terms of how I am interacting with the models.
I am building one now that has several different views which, for lack of a better term, are "canned" search result pages. These pages all return results from the same model, but they are filtered on different columns. One page we might be filtering on type, another we might be filtering on type and size, and on yet another we may be filtering on size only, etc...
I have written a function in views.py which is used by each of these pages, it takes a kwargs and in that are the criteria upon which to search. The minimum is one filter but one of the views has up to 4.
I am simply seeing if the kwargs dict contains one of the filter types, if so I filter the result on that value (I just wrote this code now, I apologize if any errors, but you should get the point):
def get_search_object(**kwargs):
q = Entry.objects.all()
if kwargs.__contains__('the_key1'):
q = q.filter(column1=kwargs['the_key1'])
if kwargs.__contains__('the_key2'):
q = q.filter(column2=kwargs['the_key2'])
return q.distinct()
Now, according to the django docs (http://docs.djangoproject.com/en/dev/topics/db/queries/#id3), these is fine, in that the DB will not be hit until the set is evaluated, lately though I have heard that this is not the most efficient way to do it and one should probably use Q objects instead.
I guess I am looking for an answer from other developers out there. My way currently works fine, if my way is totally wrong from a resources POV, then I will change ASAP.
Thanks in advance
Resource-wise, you're fine, but there are a lot of ways it can be stylistically improved to avoid using the double-underscore methods and to make it more flexible and easier to maintain.
If the kwargs being used are the actual column names then you should be able to pretty easily simplify it since what you're kind of doing is deconstructing the kwargs and rebuilding it manually but for only specific keywords.
def get_search_object(**kwargs):
entries = Entry.objects.filter(**kwargs)
return entries.distinct()
The main difference there is that it doesn't enforce that the keys be actual columns and pretty badly needs some exception handling in there. If you want to restrict it to a specific set of fields, you can specify that list and then build up a dict with the valid entries.
def get_search_object(**kwargs):
valid_fields = ['the_key1', 'the_key2']
filter_dict = {}
for key in kwargs:
if key in valid_fields:
filter_dict[key] = kwargs[key]
entries = Entry.objects.filter(**filter_dict)
return entries.distinct()
If you want a fancier solution that just checks that it's a valid field on that model, you can (ab)use _meta:
def get_search_object(**kwargs):
valid_fields = [field.name for field in Entry._meta.fields]
filter_dict = {}
for key in kwargs:
if key in valid_fields:
filter_dict[key] = kwargs[key]
entries = Entry.objects.filter(**filter_dict)
return entries.distinct()
In this case, your usage is fine from an efficiency standpoint. You would only need to use Q objects if you needed to OR your filters instead of AND.