I am implementing an eauction toy-app in Django and am confused on how to best handle concurrency in the code below. I am uncertain which of my solution candidates (or any other) fits best with the design of Django. I am fairly new to Django/python and my SQL know-how is rusty so apologies if this is a no-brainer.
Requirement: Users may bid on products. Bids are only accepted if they are higher than the previous bids on the same product.
Here is a stripped down version of the models:
class Product(models.Model):
name = models.CharField(max_length=20)
class Bid(models.Model):
amount = models.DecimalField(max_digits=5, decimal_places=2)
product = models.ForeignKey(Product)
and the bid view. This is where the race conditions occur (see comments):
def bid(request, product_id):
p = get_object_or_404(Product, pk=product_id)
form = BidForm(request.POST)
if form.is_valid():
amount = form.cleaned_data['amount']
# the following code is subject to race conditions
highest_bid_amount = Bid.objects.filter(product=product_id).aggregate(Max('amount')).get('amount__max')
# race condition: a bid might have been inserted just now by another thread so highest_bid_amount is already out of date
if (amount > highest_bid_amount):
bid = Bid(amount=amount, product_id=product_id)
# race condition: another user might have just bid on the same product with a higher amount so the save() below is incorrect
b.save()
return HttpResponseRedirect(reverse('views.successul_bid)'
Solution candidates I considered so far:
I have read the Django doc about transactions but I wouldn't know how to apply them to my problem. Since the database does not know about the requirement that bids must be ascending it cannot cause Django to throw an IntegrityError. Is there a way to define this constraint during model definition? Or did it misunderstand the transaction API?
A stored procedure could take care of the bid logic. This is seems to me the "best" choice so far but it shifts handling the race condition to the underlying database system. If this is a good approach, though, this solution might be combined with solution 1?
I considered using a select_for_update call to lock the bids for this product. However, this does not seem to be a solution since in my understanding it would not affect any new bids being created?
Wish list:
If in any way possible, I would like to refrain from locking the entire bid table, since bids on other products can not be affected anyway.
If there is a good solution on application level, I would like to keep the code independent from the underlying database system.
Many thanks for your thoughts!
Would it be possible for you to add a highest_bid column to Products. If my logic is not off, you could then update the highest bid where product_id = x and highest < current_bid. If this query indicates that a row has been updated then you add the new record to the bid table. This would probably mean that you would have to have a default value for highest_bid column.
Have you checked out Celery? You might process your queries asynchronously, queuing the queries and then handing results or errors back when they're available. That seems like a likely path to take if you want to avoid locking.
Otherwise, it does seem like some locking would need to occur.
Related
I have a Django ListView that allows to paginate through 'active' People.
The (simplified) models:
class Person(models.Model):
name = models.CharField()
# ...
active_schedule = models.ForeignKey('Schedule', related_name='+', null=True, on_delete=models.SET_NULL)
class Schedule(models.Model):
field = models.PositiveIntegerField(default=0)
# ...
person = models.ForeignKey(Person, related_name='schedules', on_delete=models.CASCADE)
The Person table contains almost 700.000 rows and the Schedule table contains just over 2.000.000 rows (on average every Person has 2-3 Schedule records, although many have none and a lot have more). For an 'active' Person, the active_schedule ForeignKey is set, of which there are about 5.000 at any time.
The ListView is supposed to show all active Person's, sorted by field on Schedule (and some other conditions, that don't seem to matter for this case).
The query then becomes:
Person.objects
.filter(active_schedule__isnull=False)
.select_related('active_schedule')
.order_by('active_schedule__field')
Specifically the order_by on the related field makes this query terribly slow (that is: it takes about a second, which is too slow for a web app).
I was hoping the filter condition would select the 5000 records, which then become relatively easily sortable. But when I run explain on this query, it shows that the (Postgres) database is messing with many more rows:
Gather Merge (cost=224316.51..290280.48 rows=565366 width=227)
Workers Planned: 2
-> Sort (cost=223316.49..224023.19 rows=282683 width=227)
Sort Key: exampledb_schedule.field
-> Parallel Hash Join (cost=89795.12..135883.20 rows=282683 width=227)
Hash Cond: (exampledb_person.active_schedule_id = exampledb_schedule.id)
-> Parallel Seq Scan on exampledb_person (cost=0.00..21263.03 rows=282683 width=161)
Filter: (active_schedule_id IS NOT NULL)
-> Parallel Hash (cost=67411.27..67411.27 rows=924228 width=66)
-> Parallel Seq Scan on exampledb_schedule (cost=0.00..67411.27 rows=924228 width=66)
I recently changed the models to be this way. In a previous version I had a model with just the ~5.000 active Person's in it. Doing the order_by on this small table was considerably faster! I am hoping to achieve the same speed with the current models.
I tried retrieving just the fields needed for the Listview (using values) which does help a little, but not much. I also tried setting the related_name on active_schedule and approaching the problem from Schedule, but that makes no difference. I tried putting a db_index on the Schedule.field, but that seems only to make things slower. Conditional queries also did not help (although I probably have not tried all possibilities). I'm at a loss.
The SQL statement generated by the ORM query:
SELECT
"exampledb_person"."id",
"exampledb_person"."name",
...
"exampledb_person"."active_schedule_id",
"exampledb_person"."created",
"exampledb_person"."updated",
"exampledb_schedule"."id",
"exampledb_schedule"."person_id",
"exampledb_schedule"."field",
...
"exampledb_schedule"."created",
"exampledb_schedule"."updated"
FROM
"exampledb_person"
INNER JOIN
"exampledb_schedule"
ON ("exampledb_person"."active_schedule_id" = "exampledb_schedule"."id")
WHERE
"exampledb_person"."active_schedule_id" IS NOT NULL
ORDER BY
"exampledb_schedule"."field" ASC
(Some fields were left out, for simplicity.)
Is it possible to speed up this query, or should I revert back to using a special Model for the active Person's?
EDIT: When I change the query, just for comparison/testing, to sort on an UNindexed field on Person, the query is equally show. However, if I then add an index to that field, the query is fast! I had to try this, as the SQL statement indeed shows that it's ordering on "exampledb_schedule"."field" - a field without index, but like I said: adding an index on the field makes no difference.
EDIT: I suppose it's also worth noting that when trying a much simpler sort query directly on Schedule, either on an indexed field or not, it's MUCH faster. For instance, for this test I've added an index to Schedule.field, then the following query is blazing fast:
Schedule.objects.order_by('field')
Somewhere in here lies the solution...
The comments by #guarav and my edits pointed me in the direction of the solution, which was staring in my face for a while...
The filter clause in my questions - filter(active_schedule__isnull=False) - seems to invalidate the database indexes. I wasn't aware of this, and had hoped a database expert would point me in this direction.
The solution is to filter on Schedule.field, which is 0 for inactive Person records and >0 for active ones:
Person.objects
.select_related('active_schedule')
.filter(active_schedule__field__gte=1)
.order_by('active_schedule__field')
This query properly uses the indexes and is fast (20ms opposed to ~1000ms).
This are simplified models to demonstrate my problem:
class User(models.Model):
username = models.CharField(max_length=30)
total_readers = models.IntegerField(default=0)
class Book(models.Model):
author = models.ForeignKey(User)
title = models.CharField(max_length=100)
class Reader(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
So, we have Users, Books and Readers (Users, who have read a Book). Thus, Reader is basically a many-to-many relationship between Book and User.
Now let's say, the current user reads a book. Now, I'd like to update the number of total readers for all books of this book's author:
# get the book (as an example pk=1)
book = Book.objects.get(pk=1)
# save Reader object for this user and this book
Reader(user=request.user, book=book).save()
# count and save the total number of readers for this author in all his books
book.author.total_readers = Reader.objects.filter(book__author=book.author).count()
book.author.save()
By doing so, Django creates a LEFT OUTER JOIN query for PostgreSQL and we get the expected result. However, the database tables are huge and this has become a bottleneck.
In this example, we could simply increase the total_readers by one on each view, instead of actually counting the database rows. However, this is just a simplified model structure and we cannot do this in reality here.
What I can do, is creating another field in the Reader model called book_author_id. Thus, I denormalize data and can count the Reader objects without having PostgreSQL making the LEFT OUTER JOIN with the User table.
Finally, here's my question: Is it possible to create some sort of database index, so that PostgreSQL handles this denormalization automatically? Or do I really have to create this additional model field and redundantly store the author's PK in there?
EDIT - to point out the essential question: I got several great answers, which work for a lot of scenarios. However, they don't solve this actual problem. The only thing I'd like to know, is if it's possible to have PostgreSQL handle such a denormalization automatically - e.g. by creating some sort of database index.
Sometimes, this query can serve better:
book.author.total_readers = Reader.objects.filter(book__in=Book.objects.filter(author=book.author)).count()
That will generate query with sub-query, sometimes it will have better performance that query with join. You even go further and end up creating 2 queries separately:
book.author.total_readers = Reader.objects.filter(book_id__in=Book.objects.filter(author=book.author).values_list('id', flat=True)).count()
That will generate 2 queries, one will retrieve list of all book IDs for that author and second will retrieve count of reads for books with ID in that list.
Good solution also may be to create some batch task that will run for example once per hour and count up all reads, but that way you will end up with not live refreshing count of reads.
You can also create celery task that will run just after read is created to generate new value for author. That way you won't have long response time and delay from creating read to counting it up won't be so long.
It's always way better to solve bottlenecks of this sort with good design and maybe a little bit of caching rather than duplicating data in the way you suggest. The total_readers field is data you should generate instead of recording.
class User(models.Model):
username = models.CharField(max_length=30)
#property
def total_readers(self):
cached_value = caching_client.get("readers_"+self.username, None)
if cached_value is None:
cached_value = self.readers()
caching_client.set("readers_"+self.username,
cached_value)
return cached_value
def readers(self):
return Reader.objects.filter(book__author__user=self).count()
There are libraries that do the caching via decorators but I felt it was a pattern you would benefit from seeing expressly. You can also attach a TTL to the cache so that you insure that the value can't be wrong for longer than TTL. You can also regenerate the cache upon creation of a Reader object.
You might actually get some mileage with declaring an m2m and defining through relationships but I have no experience of it.
I have made a previous post related to this problem here but because this is a related but new problem I thought it would be best to make another post for it.
I'm using Django 1.8
I have a User model and a UserAction model. A user has a type. UserAction has a time, which indicates how long the action took as well as a start_time which indicates when the action began. They look like this:
class User(models.Model):
user_type = models.IntegerField()
class UserAction:
user = models.ForeignKey(User)
time = models.IntegerField()
start_time = models.DateTimeField()
Now what I want to do is get all users of a given type and the sum of time of their actions, optionally filtered by the start_time.
What I am doing is something like this:
# stubbing in a start time to filter by
start_time = datetime.now() - datetime.timedelta(days=2)
# stubbing in a type
type = 2
# this gives me the users and the sum of the time of their actions, or 0 if no
# actions exist
q = User.objects.filter(user_type=type).values('id').annotate(total_time=Coalesce(Sum(useraction__time), 0)
# now I try to add the filter for start_time of the actions to be greater than or # equal to start_time
q = q.filter(useraction__start_time__gte=start_time)
Now what this does is of course is an INNER JOIN on UserAction, thus removing all the users without actions. What I really want to do is the equivalent of my LEFT JOIN with a WHERE clause, but for the life of me I can't find how to do that. I've looked at the docs, looked at the source but am not finding an answer. I'm (pretty) sure this is something that can be done, I'm just not seeing how. Could anyone point me in the right direction? Any help would be very much appreciated. Thanks much!
I'm having the same kind of problem as you. I haven't found any proper way of solving the problem yet, but I've found a few fixes.
One way would be looping through all the users:
q = User.objects.filter(user_type=type)
for (u in q):
u.time_sum = UserAction.filter(user=u, start_time__gte=start_time).aggregate(time_sum=Sum('time'))['time_sum']
This method does however a query at the database for each user. It might do the trick if you don't have many users, but might get very time-consuming if you have a large database.
Another way of solving the problem would be using the extra method of the QuerySet API. This is a method that is detailed in this blog post by Timmy O'Mahony.
valid_actions = UserAction.objects.filter(start_time__gte=start_time)
q = User.objects.filter(user_type=type).extra(select={
"time_sum": """
SELECT SUM(time)
FROM userAction
WHERE userAction.user_id = user.id
AND userAction.id IN %s
""" % (%s) % ",".join([str(uAction.id) for uAction in valid_actions.all()])
})
This method however relies on calling the database with the SQL table names, which is very un-Django - if you change the db_table of one of your databases or the db_column of one of their columns, this code will no longer work. It though only requires 2 queries, the first one to get the list of valid userAction and the other one to sum them to the matching user.
I have two models
class Subject(models.Model):
name = models.CharField(max_length=100,choices=COURSE_CHOICES)
created = models.DateTimeField('created', auto_now_add=True)
modified = models.DateTimeField('modified', auto_now=True)
syllabus = models.FileField(upload_to='syllabus')
def __unicode__(self):
return self.name
and
class Pastquestion(models.Model):
subject=models.ForeignKey(Subject)
year =models.PositiveIntegerField()
questions = models.FileField(upload_to='pastquestions')
def __unicode__(self):
return str(self.year)
Each Subject can have one or more past questions but a past question can have only one subject. I want to get a subject, and get its related past questions of a particular year. I was thinking of fetching a subject and getting its related past question.
Currently am implementing my code such that I rather get the past question whose subject and year correspond to any specified subject like
this_subject=Subject.objects.get(name=the_subject)
thepastQ=Pastquestion.objects.get(year=2000,subject=this_subject)
I was thinking there is a better way to do this. Or is this already a better way? Please Do tell ?
I think what you want is the related_name property of the ForeignKey field. This creates a link back to the Subject object and provides a manager you can use to query the set.
So to use this functionality, change the foreignkey line to:
subject=models.ForeignKey(Subject, related_name='questions')
Then with an instance of Subject we'll call subj, you can:
subj.questions.filter(year=2000)
I don't think this performs much differently to the technique you have used. Roughly speaking, SQL performance boils down a) whether there's an index and b) how many queries you're issuing. So you need to think about both. One way to find out what SQL your model usage is generating is to use SqlLogMiddleware - and alternatively play with the options in How to show the SQL Django is running It can be tempting when you get going to start issuing queries across relationships - e.g. q = Question.objects.get(year=2000, subject__name=SUBJ_MATHS) but unless you keep a close eye on these types of queries, you can and will kill your app's performance, badly.
Django's query syntax allows you to 'reach into' related objects.
past_questions = Pastquestion.objects.filter(year=2000, subject__name=subject_name)
There is a lot of topics on Django concurrency, but after checking a lot of those, I don't feel I have found my answer when it comes to transactions.
Django version 1.3.1. Postgresql version 8.4.7.
A very simple version of my models could look like this:
def Member(Model):
money = PositiveIntegerField(default=0)
user = OneToOneField(User, related_name='member', primary_key=True)
def Bet(Model):
total_money = PositiveIntegerField(default=0)
I also have a table Money which is a relation between Member and Bet. It's not directly linked to my problem but it helps me monitor it, because it can't be impacted by any concurrency issues. i.e I just have to count my table Money to test if the fields money of Member and total_money of Bet are correct.
I can't rely only on the table Money though, and I need my fields to be correct, because I filter a lot using them.
My first try for the bet function was something like this (just with a lot more modifications to a lot more tables).
def bid(user_pk, bet_pk, value):
#create Money object
member = User.objects.get(user_pk).member
member.money = F('money') - value
member.save()
bet = Bet.objects.get(bet_pk)
bet.total_money = F('total_money') + value
bet.save()
This version was working just fine until I get my first crash during one transaction.
I had also to copy paste all the tests from my clean() functions in bid(), because I'm not really able to use clean() or full_clean() in this case (especially if bet raises, after member is saved).
So I decided to give a try to django transaction.
#transaction.commit_manually
def bid(user_pk, bet_pk, value):
try:
#create money object
member = User.objects.get(user_pk).member
member.money -= value
member.clean()
member.save()
bet = Bet.objects.get(bet_pk)
bet.total_money += value
bet.clean()
bet.save()
except:
transaction.rollback()
raise
else:
transaction.commit()
But without the possibility to use F() object inside of manual transaction (which makes sense). I ended up with a lot of concurrency issues.
I see only two solutions:
Only create Money objects during the bid()/transaction, then have an asynchronous worker (Celery ?) that updates the related fields in Member and Bet.
Create a list of bid()/transaction (Redis ?), and make all transactions that modify money related fields synchronous.
Am I missing an obvious and easier solution ?
If not, what solution would you recommend using which technology ?
would this work?
#transaction.commit_on_success
def bid(user_pk, bet_pk, value):
Member.objects.filter(user__pk=user_pk).update(money=F('money') - value)
Bet.objects.filter(pk=bet_pk).update(total_money=F('total_money') + value)