Is get_or_create() thread safe

Is get_or_create() thread safe - django

I have a Django model that can only be accessed using get_or_create(session=session), where session is a foreign key to another Django model.
Since I am only accessing through get_or_create(), I would imagine that I would only ever have one instance with a key to the session. However, I have found multiple instances with keys to the same session. What is happening? Is this a race condition, or does get_or_create() operate atomically?

NO, get_or_create is not atomic.
It first asks the DB if a satisfying row exists; database returns, python checks results; if it doesn't exist, it creates it. In between the get and the create anything can happen - and a row corresponding to the get criteria be created by some other code.
For instance wrt to your specific issue if two pages are open by the user (or several ajax requests are performed) at the same time this might cause all get to fail, and for all of them to create a new row - with the same session.
It is thus important to only use get_or_create when the duplication issue will be caught by the database through some unique/unique_together, so that even though multiple threads can get to the point of save(), only one will succeed, and the others will raise an IntegrityError that you can catch and deal with.
If you use get_or_create with (a set of) fields that are not unique in the database you will create duplicates in your database, which is rarely what you want.
More in general: do not rely on your application to enforce uniqueness and avoid duplicates in your database! THat's the database job!
(well unless you wrap your critical functions with some OS-valid locks, but I would still suggest to use the database).
With thes warnings, used correctly get_or_create is an easy to read, easy to write construct that perfectly complements the database integrity checks.
Refs and citations:
http://groups.google.com/group/django-developers/browse_thread/thread/905f79e350525c95/0af3a41de4f4ce06
http://groups.google.com/group/django-developers/browse_thread/thread/f0b3381b2620e7db/8eae2f6087e550bb

Actualy it's not thread-safe, you can look at the code of the get_or_create method of the QuerySet object, basicaly what it does is the following :
try:
return self.get(**lookup), False
except self.model.DoesNotExist:
params = dict([(k, v) for k, v in kwargs.items() if '__' not in k])
params.update(defaults)
obj = self.model(**params)
sid = transaction.savepoint(using=self.db)
obj.save(force_insert=True, using=self.db)
transaction.savepoint_commit(sid, using=self.db)
return obj, True
So two threads might figure-out that the instance does not exists in the DB and start creating a new one, before saving them consecutively.

Threading is one problem, but get_or_create is broken for any serious usage in default isolation level of MySQL:
How do I deal with this race condition in django?
Why doesn't this loop display an updated object count every five seconds?
https://code.djangoproject.com/ticket/13906
http://www.no-ack.org/2010/07/mysql-transactions-and-django.html

I was having this problem with a view that calls get_or_create.
I was using Gunicorn with multiple workers, so to test it I changed the number of workers to 1 and this made the problem disappeared.
The simplest solution I found was to lock the table for access. I used this decorator to do the lock per view (for PostgreSQL):
http://www.caktusgroup.com/blog/2009/05/26/explicit-table-locking-with-postgresql-and-django/
EDIT: I wrapped the lock statement in that decorator in a try/except to deal with DB engines with no support for it (SQLite while unit testing in my case):
try:
cursor.execute('LOCK TABLE %s IN %s MODE' % (model._meta.db_table, lock))
except DatabaseError:
pass

Related

Does Model.update method in django locks the table before saving the instances?

I have a scenario in which I need to copy the values of one column into an another column. I am trying to do
Model.objects.select_related('vcsdata').all().update(charging_status_v2=F('charging_status'))
Does using F expression along with the update to copy the values would create any downtime? does it locks the table while performing the operation?
related_question_for_more_context

Short Answer:
No, it doesn't.
The only thing Django does in update process (whether you use F expression or not) is keeping the previous state of your record(s) in case if something goes wrong it can rollback to the previous state.
def update(self, **kwargs):
"""
Update all elements in the current QuerySet, setting all the given
fields to the appropriate values.
"""
self._not_support_combined_queries('update')
assert not self.query.is_sliced, \
"Cannot update a query once a slice has been taken."
self._for_write = True
query = self.query.chain(sql.UpdateQuery)
query.add_update_values(kwargs)
# Clear any annotations so that they won't be present in subqueries.
query.annotations = {}
with transaction.mark_for_rollback_on_error(using=self.db):
rows = query.get_compiler(self.db).execute_sql(CURSOR)
self._result_cache = None
return rows
Basically in the line with transaction.mark_for_rollback_on_error(using=self.db), it keeps the previous state of your record, but it does not lock your table or any kind of partial locks.
For example if you have two simultaneous updates at the same time, (suppose one of them is going to take much longer than the other and also slower one hits your database before faster one) then the faster one is going to hit your database regardless of the slower one and does the operation. Then slower one is going to do some other operation on your table (this example is enough for proving that update does not lock your table).
Also note that calling update for updating multiple objects (if this is a doable thing) is the most efficient way for updating multiple objects as far as I know (comparing to calling save on each instance or bulk update).

Synchronized Model instances in Django

I'm building a model for a Django project (my first Django project) and noticed
that instances of a Django model are not synchronized.
a_note = Notes.objects.create(message="Hello") # pk=1
same_note = Notes.objects.get(pk=1)
same_note.message = "Good day"
same_note.save()
a_note.message # Still is "Hello"
a_note is same_note # False
Is there a built-in way to make model instances with the same primary key to be
the same object? If yes, (how) does this maintain a globally consistent state of all
model objects, even in the case of bulk updates or changing foreign keys
and thus making items enter/exit related sets?
I can imagine some sort of registry in the model class, which could at least handle simple cases (i.e. it would fail in cases of bulk updates or a change in foreign keys). However, the static registry makes testing more difficult.
I intend to build the (domain) model with high-level functions to do complex
operations which go beyond the simple CRUD
actions of Django's Model class. (Some classes of my model have an instance
of a Django Model subclass, as opposed to being an instance of subclass. This
is by design to prevent direct access to the database which might break consistencies and to separate the business logic from the purely data access related Django Model.) A complex operation might touch and modify several components. As a developer
using the model API, it's impossible to know which components are out of date after
calling a complex operation. Automatically synchronized instances would mitigate this issue. Are there other ways to overcome this?

TL;DR "Is there a built-in way to make model instances with the same primary key to be the same object?" No.
A python object in memory isn't the same thing as a row in your database. So when you create a_note and then fetch same_note from the db, those are two different objects in memory, even though they are the same representation of the underlying row in your database. When you fetch same_note, in fact, you instantiate a new Notes object and initialise it with the values fetched from the database.
Then you change and save same_note, but the a_note object in memory isn't changed. If you did a_note.refresh_from_db() you would see that a_note.message was changed.
Now a_note is same_note will always be False because the location in memory of these two objects will always be different. Two variables are the same (is is True) if they point to the same object in memory.
But a_note == same_note will return True at any time, since Django defines two model instances to be equal if their pk is the same.
Note that if the complexity you're talking about is that in the case of multiple requests one request might change underlying values that are being used by another request, then use F to avoid race conditions.
Within one request, since everything is sequential and single threaded, there's not risk of variables going out of sync: You know the order in which things are done and therefore can always call refresh_from_db() when you know a previous method call might have changed the database value.
Note also: Having two variables holding the same row means you'll have performed two queries to your db, which is the one thing you want to avoid at all cost. So you should think why you have this situation in the first place.

How to properly increment a counter in my postgreSQL database?

Let's say i want to implement a "Like/Unlike" system in my app. I need to count each like for sorting purposes later. Can i simply insert the current value + 1 ? I think it's too simple.
What if two user click on the same time ? How to prevent my counter to be disturbed ?
I read i need to implement transactions by a simple decorator #transaction.atomic but i am wonder if this can handle my concern.
Transactions are designed to execute a "bloc" of operations triggered by one user, whereas in my case i need be able to handle multiple request at the same time and safely update the counter.
Any advise ?

You can use F() expression, eg.
content.likes_count = F('likes_count') + 1
content.save()
So the operation will be excuted in database not in python.
From the django documentation.
Another useful benefit of F() is that having the database - rather
than Python - update a field’s value avoids a race condition.
If two Python threads execute the code in the first example above, one
thread could retrieve, increment, and save a field’s value after the
other has retrieved it from the database. The value that the second
thread saves will be based on the original value; the work of the
first thread will simply be lost.
If the database is responsible for updating the field, the process is
more robust: it will only ever update the field based on the value of
the field in the database when the save() or update() is executed,
rather than based on its value when the instance was retrieved.

Celery/django duplicate key violations

I have a single celery worker with 5 threads. It's scraping websites and saving domains to DB via django's ORM.
Here is roughly how it looks like:
domain_all = list(Domain.objects.all())
needs_domain = set()
for x in dup_free_scrape:
domain = x['domain']
if any(domain.lower() == s.name.lower() for s in domain_all):
x['domainn'] = [o for o in domain_all if domain.lower() == o.name.lower()][0]
else:
print('adding: {}'.format(domain))
needs_domain.add(domain)
create_domains = [Domain(name=b.lower()) for b in needs_domain]
create_domains_ids = Domain.objects.bulk_create(create_domains)
Probably not the best way, but it checks domains in one dict(dup_free_scrape) against all domains already in database.
It can go over hundreds or even thousands before encountering the error, but sometimes it does:
Task keywords.domains.model_work[285c3e74-8e47-4925-9ab6-a99540a24665]
raised unexpected: IntegrityError('duplicate key value violates unique
constraint "keywords_domain_name_key"\nDETAIL: Key
(name)=(domain.com) already exists.\n',)
django.db.utils.IntegrityError: duplicate key value violates unique
constraint "keywords_domain_name_key"
The only reason for this issue I can think of would be: One thread saved domain to DB while another was in the middle of code above?
I can't find any good solutions, but here is and idea(not sure if any good): Wrap whole thing in transaction and if databaise raises error simplty retry(query database for "Domain.objects.all()" again).

If you are creating these records in bulk and multiple threads are at it, it's indeed very likely that IntegrityErrors are caused by different threads inserting the same data. Do you really need multiple threads working on this? If yes you could try:
create_domains = []
create_domain_ids = []
for x in dup_free_scrape:
domain = x['domain']
new_domain, created = Domain.objects.get_or_create(name = domain.lower()
if created:
create_domains.append(domain.lower())
created_domain_ids.append(new_domain.pk)
Note that this is all the code. The select all query which you had right at the start is not needed. Domain.objects.all() is going to be very inefficient because you are reading the entire table there.
Also note that your list comprehension for x['domain'] appeared to be completely redundant.
create_domains and create_domain_ids lists may not be needed unless you want to keep track of what was being created.
Please make sure that you have the proper index on domain name. From get_or_create docs:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.

Do I need to commit transactions in Django 1.6?

I want to create and object, save it to DB, then check if there is another row on the DB with the same token with execution_time=0. If there is, I want to delete the object created then restart the process.
transfer = Transfer(token = generateToken(size=9))
transfer.save()
while (len(Transfer.objects.filter(token=transfer.token, execution_time=0))!=1):
transfer.delete()
transfer = Transfer(token = generateToken(size=9))
transfer.save()
Do I need to commit the transaction between every loop? For example calling commit() at the end of every loop?
while (len(Transfer.objects.filter(token=transfer.token, execution_time=0))!=1):
transfer.delete()
transfer = Transfer(token = generateToken(size=9))
transfer.save()
commit()
#transaction.commit_manually
def commit():
transaction.commit()

From what you've described I don't think you need to use transactions. You're basically recreating a transaction rollback manually with your code.
I think the best way to handle this would be to have a database constraint enforce the issue. Is it the case that token and execution_time should be unique together? In that case you can define the constraint in Django with unique_together. If the constraint is that token should be unique whenever execution_time is 0, some databases will let you define a constraint like that as well.
If the constraint were in the database you could just do a get_or_create() in a loop until the Transfer was created.
If you can't define the constraint in the database for whatever reason then I think your version would work. (One improvement would be to use .count() instead of len.)

I want to create and object, save it to DB, then check if there is
another row on the DB with the same token with execution_time=0. If
there is, I want to delete the object created then restart the
process.
There are few ways you can approach this, depending on what your end goal is:
Do you want that no other record is written while you are writing yours (to prevent duplicates?) If so, you need to get a lock on your table, and to do that, you need to execute an atomic transaction, with #transaction.atomic (new in 1.6)
If you want to make sure that no duplicate records are created given a combination of fields, you need to enforce this at the database level with unique_together
I believe combining the above two will solve your problem; however, if you want a more brute force approach; you can override the save() method for your object, and then raise an appropriate exception when a record is trying to be created (or updated) that violates your constraints.
In your view, you would then catch this exception and then take the appropriate action.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js