Multi-DB Transactions - django

Django Version 1.10.5 with Postgres 9.6.1
For the last year I've been working in a multi-schema default database environment. However things are beginning to grow to the point I've decided to split the single database into 3 databases.
I've got things working with a master/slave router for all 3 databases.
I am not using the 'default' database key. Instead I have 'db1', 'db2', and 'db3'
The part I am confused about is with transactions in this multi-database environment.
In this example it fails as expected. Caused of course by not using #transaction.atomic(using='db1') which is clear to me.
#transaction.atomic()
def edit(self, context):
"""Edit
:param dict context: Context
:return: None
"""
# Check if employee exists
try:
result = Passport.objects.get(pk=self.user.employee_id)
except Passport.DoesNotExist:
return False
result.name = context.get('name')
result.save()
However I have this strange example, simply because I'm trying to understand... I would have expected this to fail but it does not:
#transaction.atomic(using='db1')
def edit(self, context):
"""Edit
:param dict context: Context
:return: None
"""
# Check if employee exists
try:
result = Passport.objects.get(pk=self.user.employee_id)
except Passport.DoesNotExist:
return False
result.name = context.get('name')
with transaction.atomic(using='db2'):
result.save()
The model Passport does not exist in DB2 models at all.
My router is setup so that all writes go to each respected DB.
So what is the purpose of setting the using='db1' in the atomic transaction? I've looked at the source and I see it defaults to default when not "using".
In the above example I even made another transaction inside of the initial transaction but this time using='db2' where the model doesn't even exist. I figured that would have failed, but it didn't and the data was written to the proper database.
I bring this up because there will be situations where I need to interact with all 3 databases and if a single problem occurs when writing to all 3 databases, all 3 need to be rolled back or if on success of everything, then committed of course.
Perhaps someone can help break this down for me so I can understand?

You're interpreting transaction.atomic(using='X') to mean: run the following database commands on X, inside a transaction.
In fact, it just means: open a transaction on database X, and then either commit it or roll it back at the end of the block.
Or, as the documentation puts it:
Under the hood, Django’s transaction management code:
opens a transaction when entering the outermost atomic block;
commits or rolls back the transaction when exiting the outermost block.
The question of which database to use for a given command is determined by your router, not the using clause. So your transaction.atomic(using='db2') block is pointless (it will simply open a transaction on db2 and then close it), but not an error.

Related

How to make obj.save() without reversing object values in the db in django

I have recursive function and obj.save() is inside it.
how to prevent the query from db at every iteration?
is django transaction.atomic do that.
If you're using django >= 2.2 (which you should be using.. since ALL other versions of django are 100% out of support as of me writing this Jan 5, 2020) you can do this:
objs = []
for obj in Entry.objects.filter(...):
if not obj.condition:
continue
obj.headline = 'something!!!'
obj.author = 'John Smith'
objs.append(obj)
with transaction.atomic():
Entry.objects.bulk_update(objs, ['headline', 'author'])
Couple of things to note:
all the work is done outside of the transaction.atomic
transaction.atomic means that if anything fails inside that block, it will rollback the WHOLE work (transaction) and not keep a piece of it around. Example: you have 2 authors to save, first one saves successfully, second one does not. Because is inside the transaction atomic, it means both of them are NOT committed. It has nothing to do with doing it all in one query
More information could be found here: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#bulk-update

Django when looping over a queryset, when does the db read happen?

I am looping through my database and updating all my Company objects.
for company in Company.objects.filter(updated=False):
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
My problem is that it's taking too long so I wanted to run another instance of this same code, but I'm curious when the actual db reads happen. If company.visited get's changed to True while this loop is running, will still be visited by this loop? What if I added a second check for visited? I don't want to start a second loop if the first instance isn't going to recognize the work of the second instance:
for company in Company.objects.filter(updated=False):
if company.visited:
continue
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
Company.objects.filter(updated=False) translates to an ordinary SQL query:
SELECT * FROM appName_company WHERE updated is false
This SQL query is executed when you start iterating through Company objects. It's executed only once. The second server will not recognize the work of the first one, because they both will go through the same Company objects.
Lock rows to avoid race conditions using atomic transactions and select_for_update():
from django.db import transaction
for company in Company.objects.filter(updated=False):
with transaction.atomic():
Company.objects.select_for_update().get(id=company.id)
if company.visited:
continue
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
You can run this code on multiple servers. Each Company will be processed just once.
If you need to execute this code regularly, I highly recommend using Celery. Dispatch a task per each company, and let multiple workers do the work in parallel:
from celery import shared_task
#shared_task
def dispatch_tasks():
for company in Company.objects.filter(updated=False):
process_company.delay(company.id)
#shared_task
#transaction.atomic
def process_company(company_id):
company = Company.objects.select_for_update().get(id=company_id)
if company.visited:
continue
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
Edit: oh, I see that you've tagged the question with the sqlite tag. I recommend switching to PostgreSQL, as SQLite is really bad at concurrency. My answer should work with SQlite, but locks may slow down the database.

Commit manually in Django data migration

I'd like to write a data migration where I modify all rows in a big table in smaller batches in order to avoid locking issues. However, I can't figure out how to commit manually in a Django migration. Everytime I try to run commit I get:
TransactionManagementError: This is forbidden when an 'atomic' block is active.
AFAICT, the database schema editor always wraps Postgres migrations in an atomic block.
Is there a sane way to break out of the transaction from within the migration?
My migration looks like this:
def modify_data(apps, schema_editor):
counter = 0
BigData = apps.get_model("app", "BigData")
for row in BigData.objects.iterator():
# Modify row [...]
row.save()
# Commit every 1000 rows
counter += 1
if counter % 1000 == 0:
transaction.commit()
transaction.commit()
class Migration(migrations.Migration):
operations = [
migrations.RunPython(modify_data),
]
I'm using Django 1.7 and Postgres 9.3. This used to work with South and older versions of Django.
The best workaround I found is manually exiting the atomic scope before running the data migration:
def modify_data(apps, schema_editor):
schema_editor.atomic.__exit__(None, None, None)
# [...]
In contrast to resetting connection.in_atomic_block manually this allows using atomic context manager inside the migration. There doesn't seem to be a much saner way.
One can contain the (admittedly messy) transaction break out logic in a decorator to be used with the RunPython operation:
def non_atomic_migration(func):
"""
Close a transaction from within code that is marked atomic. This is
required to break out of a transaction scope that is automatically wrapped
around each migration by the schema editor. This should only be used when
committing manually inside a data migration. Note that it doesn't re-enter
the atomic block afterwards.
"""
#wraps(func)
def wrapper(apps, schema_editor):
if schema_editor.connection.in_atomic_block:
schema_editor.atomic.__exit__(None, None, None)
return func(apps, schema_editor)
return wrapper
Update
Django 1.10 will support non-atomic migrations.
From the documentation about RunPython:
By default, RunPython will run its contents inside a transaction on databases that do not support DDL transactions (for example, MySQL and Oracle). This should be safe, but may cause a crash if you attempt to use the schema_editor provided on these backends; in this case, pass atomic=False to the RunPython operation.
So, instead of what you've got:
class Migration(migrations.Migration):
operations = [
migrations.RunPython(modify_data, atomic=False),
]
For others coming across this. You can have both data (RunPython), in the same migration. Just make sure all the alter tables goes first. You cannot do the RunPython before any ALTER TABLE.
First you need to set Migration.atomic = False
class Migration(migrations.Migration):
atomic = False
Then in your function you can wrap certain block of code inside of transaction.atomic() to make only that block atomic
from django.db import transaction
for row in rows:
with transaction.atomic():
do_something(row)
# Changes made by `do_something` will be committed by this point
Here's the relevant documentation: https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations
Gotcha: migrations.RunPython(forwards_func, atomic=False) does NOT do what you want. It prevents django from manually putting your migration code inside a transaction, which it doesn't do for Postgresql anyway. This atomic=False option is meant for DBs that don't support DDL transaction, as stated in their documentation: https://docs.djangoproject.com/en/4.1/ref/migration-operations/#runpython
By default, RunPython will run its contents inside a transaction on databases that do not support DDL transactions (for example, MySQL and Oracle). This should be safe, but may cause a crash if you attempt to use the schema_editor provided on these backends; in this case, pass atomic=False to the RunPython operation.
On databases that do support DDL transactions (SQLite and PostgreSQL), RunPython operations do not have any transactions automatically added besides the transactions created for each migration.

Django-celery task and django transaction

I have a question regarding transactions and celery tasks. So it's no mystery to me that of course if you have a transaction and a celery task accessing the same table/records we'll have a race condition.
However, consider the following piece of code:
def f(self):
# function of module that inherits from models.Model
self.field_a = datetime.now()
self.save()
transaction.commit_unless_managed()
# depending on the configuration of this module
# this might return None or a datetime object.
eta = self.get_task_eta()
if eta:
celery_task_do_something.apply_async(args=(self.pk, self.__class__),
eta=eta)
else:
celery_task_do_something.delay(self.pk, self.__class__)
Here's the celery task:
def celery_task_do_something(pk, cls):
o = cls.objects.get(pk=pk)
if o.field_a:
# perform something
return True
return False
As you can see, before creating the task we call transaction.commit_unless_managed and it should commit, since django transaction is not currently managed.
However, when running celery task the field field_a is not set.
My question:
Since we do commit before creating the task, is it still possible that there's a race condition?
Additional info
We're using Postgres version 9.1
Every transaction is run with READ COMMITTED isolation level
On a different db with engine dowant.lib.db.backends.postgresql_psycopg2_debugger field_a is already set and the task works as expected. With engine dowant.lib.db.backends.postgresql_psycopg2_hstore_ready the described issue appears (not sure if it's related with the engine).
Celery version is 2.2
I tried different databases. Still the same behavior, except when the engines change. So that's why I mentioned this.
Thanks a lot.
Try to add self.__class__.objects.select_for_update().get(pk=self.pk) before save and see what happens.
It should block all reads to this row untill commit is done.
This is late but since django 1.9
transaction.on_commit(lambda: enqueue_atask()))

How do I deal with this race condition in django?

This code is supposed to get or create an object and update it if necessary. The code is in production use on a website.
In some cases - when the database is busy - it will throw the exception "DoesNotExist: MyObj matching query does not exist".
# Model:
class MyObj(models.Model):
thing = models.ForeignKey(Thing)
owner = models.ForeignKey(User)
state = models.BooleanField()
class Meta:
unique_together = (('thing', 'owner'),)
# Update or create myobj
#transaction.commit_on_success
def create_or_update_myobj(owner, thing, state)
try:
myobj, created = MyObj.objects.get_or_create(owner=user,thing=thing)
except IntegrityError:
myobj = MyObj.objects.get(owner=user,thing=thing)
# Will sometimes throw "DoesNotExist: MyObj matching query does not exist"
myobj.state = state
myobj.save()
I use an innodb mysql database on ubuntu.
How do I safely deal with this problem?
This could be an off-shoot of the same problem as here:
Why doesn't this loop display an updated object count every five seconds?
Basically get_or_create can fail - if you take a look at its source, there you'll see that it's: get, if-problem: save+some_trickery, if-still-problem: get again, if-still-problem: surrender and raise.
This means that if there are two simultaneous threads (or processes) running create_or_update_myobj, both trying to get_or_create the same object, then:
first thread tries to get it - but it doesn't yet exist,
so, the thread tries to create it, but before the object is created...
...second thread tries to get it - and this obviously fails
now, because of the default AUTOCOMMIT=OFF for MySQLdb database connection, and REPEATABLE READ serializable level, both threads have frozen their views of MyObj table.
subsequently, first thread creates its object and returns it gracefully, but...
...second thread cannot create anything as it would violate unique constraint
what's funny, subsequent get on the second thread doesn't see the object created in the first thread, due to the frozen view of MyObj table
So, if you want to safely get_or_create anything, try something like this:
#transaction.commit_on_success
def my_get_or_create(...):
try:
obj = MyObj.objects.create(...)
except IntegrityError:
transaction.commit()
obj = MyObj.objects.get(...)
return obj
Edited on 27/05/2010
There is also a second solution to the problem - using READ COMMITED isolation level, instead of REPEATABLE READ. But it's less tested (at least in MySQL), so there might be more bugs/problems with it - but at least it allows tying views to transactions, without committing in the middle.
Edited on 22/01/2012
Here are some good blog posts (not mine) about MySQL and Django, related to this question:
http://www.no-ack.org/2010/07/mysql-transactions-and-django.html
http://www.no-ack.org/2011/05/broken-transaction-management-in-mysql.html
Your exception handling is masking the error. You should pass a value for state in get_or_create(), or set a default in the model and database.
One (dumb) way might be to catch the error and simply retry once or twice after waiting a small amount of time. I'm not a DB expert, so there might be a signaling solution.
Since 2012 in Django we have select_for_update which lock rows until the end of the transaction.
To avoid race conditions in Django + MySQL
under default circumstances:
REPEATABLE_READ in the Mysql
READ_COMMITTED in the Django
you can use this:
with transaction.atomic():
instance = YourModel.objects.select_for_update().get(id=42)
instance.evolve()
instance.save()
The second thread will wait for the first thread (lock), and only if first is done, the second will read data saved by first, so it will work on updated data.
Then together with get_or_create:
def select_for_update_or_create(...):
instance = YourModel.objects.filter(
...
).select_for_update().first()
if order is None:
instnace = YouModel.objects.create(...)
return instance
The function must be inside transaction block, otherwise, you will get from Django:
TransactionManagementError: select_for_update cannot be used outside of a transaction
Also sometimes it's good to use refresh_from_db()
In case like:
instance = YourModel.objects.create(**kwargs)
response = do_request_which_lasts_few_seconds(instance)
instance.attr = response.something
you'd like to see:
instance = MyModel.objects.create(**kwargs)
response = do_request_which_lasts_few_seconds(instance)
instance.refresh_from_db() # 3
instance.attr = response.something
and that # 3 will reduce a lot a time window of possible race conditions, thus chance for that.