Commit manually in Django data migration - django

I'd like to write a data migration where I modify all rows in a big table in smaller batches in order to avoid locking issues. However, I can't figure out how to commit manually in a Django migration. Everytime I try to run commit I get:
TransactionManagementError: This is forbidden when an 'atomic' block is active.
AFAICT, the database schema editor always wraps Postgres migrations in an atomic block.
Is there a sane way to break out of the transaction from within the migration?
My migration looks like this:
def modify_data(apps, schema_editor):
counter = 0
BigData = apps.get_model("app", "BigData")
for row in BigData.objects.iterator():
# Modify row [...]
row.save()
# Commit every 1000 rows
counter += 1
if counter % 1000 == 0:
transaction.commit()
transaction.commit()
class Migration(migrations.Migration):
operations = [
migrations.RunPython(modify_data),
]
I'm using Django 1.7 and Postgres 9.3. This used to work with South and older versions of Django.

The best workaround I found is manually exiting the atomic scope before running the data migration:
def modify_data(apps, schema_editor):
schema_editor.atomic.__exit__(None, None, None)
# [...]
In contrast to resetting connection.in_atomic_block manually this allows using atomic context manager inside the migration. There doesn't seem to be a much saner way.
One can contain the (admittedly messy) transaction break out logic in a decorator to be used with the RunPython operation:
def non_atomic_migration(func):
"""
Close a transaction from within code that is marked atomic. This is
required to break out of a transaction scope that is automatically wrapped
around each migration by the schema editor. This should only be used when
committing manually inside a data migration. Note that it doesn't re-enter
the atomic block afterwards.
"""
#wraps(func)
def wrapper(apps, schema_editor):
if schema_editor.connection.in_atomic_block:
schema_editor.atomic.__exit__(None, None, None)
return func(apps, schema_editor)
return wrapper
Update
Django 1.10 will support non-atomic migrations.

From the documentation about RunPython:
By default, RunPython will run its contents inside a transaction on databases that do not support DDL transactions (for example, MySQL and Oracle). This should be safe, but may cause a crash if you attempt to use the schema_editor provided on these backends; in this case, pass atomic=False to the RunPython operation.
So, instead of what you've got:
class Migration(migrations.Migration):
operations = [
migrations.RunPython(modify_data, atomic=False),
]

For others coming across this. You can have both data (RunPython), in the same migration. Just make sure all the alter tables goes first. You cannot do the RunPython before any ALTER TABLE.

First you need to set Migration.atomic = False
class Migration(migrations.Migration):
atomic = False
Then in your function you can wrap certain block of code inside of transaction.atomic() to make only that block atomic
from django.db import transaction
for row in rows:
with transaction.atomic():
do_something(row)
# Changes made by `do_something` will be committed by this point
Here's the relevant documentation: https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations
Gotcha: migrations.RunPython(forwards_func, atomic=False) does NOT do what you want. It prevents django from manually putting your migration code inside a transaction, which it doesn't do for Postgresql anyway. This atomic=False option is meant for DBs that don't support DDL transaction, as stated in their documentation: https://docs.djangoproject.com/en/4.1/ref/migration-operations/#runpython
By default, RunPython will run its contents inside a transaction on databases that do not support DDL transactions (for example, MySQL and Oracle). This should be safe, but may cause a crash if you attempt to use the schema_editor provided on these backends; in this case, pass atomic=False to the RunPython operation.
On databases that do support DDL transactions (SQLite and PostgreSQL), RunPython operations do not have any transactions automatically added besides the transactions created for each migration.

Related

Commit SQL even inside atomic transaction (django)

How can I always commit a insert even inside an atomic transaction? In this case I need just one point to be committed and everything else rolled back.
For example, my view, decorator contains with transaction.atomic() and other stuffs:
#my_custom_decorator_with_transaction_atomic
def my_view(request):
my_core_function()
return ...
def my_core_function():
# many sql operations that need to rollback in case of error
try:
another_operation()
except MyException:
insert_activity_register_on_db() # This needs to be in DB, and not rolled back
raise MyException()
I would not like to make another decorator for my view without transaction atomic and do it manually on core. Is there a way?

Multi-DB Transactions

Django Version 1.10.5 with Postgres 9.6.1
For the last year I've been working in a multi-schema default database environment. However things are beginning to grow to the point I've decided to split the single database into 3 databases.
I've got things working with a master/slave router for all 3 databases.
I am not using the 'default' database key. Instead I have 'db1', 'db2', and 'db3'
The part I am confused about is with transactions in this multi-database environment.
In this example it fails as expected. Caused of course by not using #transaction.atomic(using='db1') which is clear to me.
#transaction.atomic()
def edit(self, context):
"""Edit
:param dict context: Context
:return: None
"""
# Check if employee exists
try:
result = Passport.objects.get(pk=self.user.employee_id)
except Passport.DoesNotExist:
return False
result.name = context.get('name')
result.save()
However I have this strange example, simply because I'm trying to understand... I would have expected this to fail but it does not:
#transaction.atomic(using='db1')
def edit(self, context):
"""Edit
:param dict context: Context
:return: None
"""
# Check if employee exists
try:
result = Passport.objects.get(pk=self.user.employee_id)
except Passport.DoesNotExist:
return False
result.name = context.get('name')
with transaction.atomic(using='db2'):
result.save()
The model Passport does not exist in DB2 models at all.
My router is setup so that all writes go to each respected DB.
So what is the purpose of setting the using='db1' in the atomic transaction? I've looked at the source and I see it defaults to default when not "using".
In the above example I even made another transaction inside of the initial transaction but this time using='db2' where the model doesn't even exist. I figured that would have failed, but it didn't and the data was written to the proper database.
I bring this up because there will be situations where I need to interact with all 3 databases and if a single problem occurs when writing to all 3 databases, all 3 need to be rolled back or if on success of everything, then committed of course.
Perhaps someone can help break this down for me so I can understand?
You're interpreting transaction.atomic(using='X') to mean: run the following database commands on X, inside a transaction.
In fact, it just means: open a transaction on database X, and then either commit it or roll it back at the end of the block.
Or, as the documentation puts it:
Under the hood, Django’s transaction management code:
opens a transaction when entering the outermost atomic block;
commits or rolls back the transaction when exiting the outermost block.
The question of which database to use for a given command is determined by your router, not the using clause. So your transaction.atomic(using='db2') block is pointless (it will simply open a transaction on db2 and then close it), but not an error.

Django migrations using RunPython to commit changes

I want to alter a foreign key in one of my models that can currently have NULL values to not be nullable.
I removed the null=True from my field and ran makemigrations
Because I'm an altering a table that already has rows which contain NULL values in that field I am asked to provide a one-off value right away or edit the migration file and add a RunPython operation.
My RunPython operation is listed BEFORE the AlterField operation and does the required update for this field so it doesn't contain NULL values (only rows who already contain a NULL value).
But, the migration still fails with this error:
django.db.utils.OperationalError: cannot ALTER TABLE "my_app_site" because it has pending trigger events
Here's my code:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from django.db import models, migrations
def add_default_template(apps, schema_editor):
Template = apps.get_model("my_app", "Template")
Site = apps.get_model("my_app", "Site")
accept_reject_template = Template.objects.get(name="Accept/Reject")
Site.objects.filter(template=None).update(template=accept_reject_template)
class Migration(migrations.Migration):
dependencies = [
('my_app', '0021_auto_20150210_1008'),
]
operations = [
migrations.RunPython(add_default_template),
migrations.AlterField(
model_name='site',
name='template',
field=models.ForeignKey(to='my_app.Template'),
preserve_default=False,
),
]
If I understand correctly this error may occur when a field is altered to be not-nullable but the field contains null values.
In that case, the only reason I can think of why this happens is because the RunPython operation transaction didn't "commit" the changes in the database before running the AlterField.
If this is indeed the reason - how can I make sure the changes reflect in the database?
If not - what can be the reason for the error?
Thanks!
This happens because Django creates constraints as DEFERRABLE INITIALLY DEFERRED:
ALTER TABLE my_app_site
ADD CONSTRAINT "[constraint_name]"
FOREIGN KEY (template_id)
REFERENCES my_app_template(id)
DEFERRABLE INITIALLY DEFERRED;
This tells PostgreSQL that the foreign key does not need to be checked right after every command, but can be deferred until the end of transactions.
So, when a transaction modifies content and structure, the constraints are checked on parallel with the structure changes, or the checks are scheduled to be done after altering the structure. Both of these states are bad and the database will abort the transaction instead of making any assumptions.
You can instruct PostgreSQL to check constraints immediately in the current transaction by calling SET CONSTRAINTS ALL IMMEDIATE, so structure changes won't be a problem (refer to SET CONSTRAINTS documentation). Your migration should look like this:
operations = [
migrations.RunSQL('SET CONSTRAINTS ALL IMMEDIATE',
reverse_sql=migrations.RunSQL.noop),
# ... the actual migration operations here ...
migrations.RunSQL(migrations.RunSQL.noop,
reverse_sql='SET CONSTRAINTS ALL IMMEDIATE'),
]
The first operation is for applying (forward) migrations, and the last one is for unapplying (backwards) migrations.
EDIT: Constraint deferring is useful to avoid insertion sorting, specially for self-referencing tables and tables with cyclic dependencies. So be careful when bending Django.
LATE EDIT: on Django 1.7 and newer versions there is a special SeparateDatabaseAndState operation that allows data changes and structure changes on the same migration. Try using this operation before resorting to the "set constraints all immediate" method above. Example:
operations = [
migrations.SeparateDatabaseAndState(database_operations=[
# put your sql, python, whatever data migrations here
],
state_operations=[
# field/model changes goes here
]),
]
Yes, I'd say it's the transaction bounds which are preventing the data change in your migration being committed before the ALTER is run.
I'd do as #danielcorreia says and implement it as two migrations, as it looks like the even the SchemaEditor is bound by transactions, via the the context manager you'd be obliged to use.
Adding null to the field giving you a problem should fix it. In your case the "template" field. Just add null=True to the field. The migrations should than look like this:
class Migration(migrations.Migration):
dependencies = [
('my_app', '0021_auto_20150210_1008'),
]
operations = [
migrations.RunPython(add_default_template),
migrations.AlterField(
model_name='site',
name='template',
field=models.ForeignKey(to='my_app.Template', null=True),
preserve_default=False,
),
]

Django South - schema and data migration at the same time

Isn't it possible to do something like the following with South in a schemamigration?
def forwards(self, orm):
## CREATION
# Adding model 'Added'
db.create_table(u'something_added', (
(u'id', self.gf('django.db.models.fields.AutoField')(primary_key=True)),
('foo', self.gf('django.db.models.fields.related.ForeignKey')(to=orm['something.Foo'])),
('bar', self.gf('django.db.models.fields.related.ForeignKey')(to=orm['something.Bar'])),
))
db.send_create_signal(u'something', ['Added'])
## DATA
# Create Added for every Foo
for f in orm.Foo.objects.all():
self.prev_orm.Added.objects.create(foo=f, bar=f.bar)
## DELETION
# Deleting field 'Foo.bar'
db.delete_column(u'something_foo', 'bar_id')
See the prev_orm that would allow me to access to f.bar, and do all in one. I find that having to write 3 migrations for that is pretty heavy...
I know this is not the "way to do" but to my mind this would be honestly much cleaner.
Would there be a real problem to do so btw?
I guess your objective is to ensure that deletion does not run before the data-migration. For this you can use the dependency system in South.
You can break the above into three parts:
001_app1_addition_migration (in app 1)
then
001_app2_data_migration (in app 2, where the Foo model belongs)
and then
002_app1_deletion_migration (in app 1) with something like following:
class Migration:
depends_on = (
("app2", "001_app2_data_migration"),
)
def forwards(self):
## DELETION
# Deleting field 'Foo.bar'
db.delete_column(u'something_foo', 'bar_id')
First of all, the orm provided by South is the one that you are migrating to. In other words, it matches the schema after the migration is complete. So you can just write orm.Added instead of self.prev_orm.Added. The other implication of this fact is that you cannot reference foo.bar since it is not present in the final schema.
The way to get around that (and to answer your question) is to skip the ORM and just execute raw SQL directly.
In your case, the create statement that accesses the deleted row would look something like:
cursor.execute('SELECT "id", "bar_id" FROM "something_foo"')
for foo_id, bar_id in cursor.fetchall()
orm.Added.ojbects.create(foo_id=foo_id, bar_id=bar_id)
South migrations are using transaction management.
When doing several migrations at once, the code is similar to:
for migration in migrations:
south.db.db.start_transaction()
try:
migration.forwards(migration.orm)
south.db.db.commit_transaction()
except:
south.db.db.rollback_transaction()
raise
so... while it is not recommended to mix schema and data migrations, once you commit the schema with db.commit_transaction() the tables should be available for you to use. Be mindful to provide a backwards() method that does that correct steps backwards.

Django-celery task and django transaction

I have a question regarding transactions and celery tasks. So it's no mystery to me that of course if you have a transaction and a celery task accessing the same table/records we'll have a race condition.
However, consider the following piece of code:
def f(self):
# function of module that inherits from models.Model
self.field_a = datetime.now()
self.save()
transaction.commit_unless_managed()
# depending on the configuration of this module
# this might return None or a datetime object.
eta = self.get_task_eta()
if eta:
celery_task_do_something.apply_async(args=(self.pk, self.__class__),
eta=eta)
else:
celery_task_do_something.delay(self.pk, self.__class__)
Here's the celery task:
def celery_task_do_something(pk, cls):
o = cls.objects.get(pk=pk)
if o.field_a:
# perform something
return True
return False
As you can see, before creating the task we call transaction.commit_unless_managed and it should commit, since django transaction is not currently managed.
However, when running celery task the field field_a is not set.
My question:
Since we do commit before creating the task, is it still possible that there's a race condition?
Additional info
We're using Postgres version 9.1
Every transaction is run with READ COMMITTED isolation level
On a different db with engine dowant.lib.db.backends.postgresql_psycopg2_debugger field_a is already set and the task works as expected. With engine dowant.lib.db.backends.postgresql_psycopg2_hstore_ready the described issue appears (not sure if it's related with the engine).
Celery version is 2.2
I tried different databases. Still the same behavior, except when the engines change. So that's why I mentioned this.
Thanks a lot.
Try to add self.__class__.objects.select_for_update().get(pk=self.pk) before save and see what happens.
It should block all reads to this row untill commit is done.
This is late but since django 1.9
transaction.on_commit(lambda: enqueue_atask()))