How to test that nothing in a django-accessed database has changed? - django

I'm refactoring some code and adding an atomic.transaction block. We have some code that performs auto-updates en-masse, after the atomic block. (It can't be in the atomic block, because it traverses some many-related tables, which requires queries since the ORM doesn't store the distinct relations.)
According to my reading of the code, it seems impossible that any changes could occur, but having a test to prove it would be nice in case someone makes a bad code edit in the future.
So how would I go about writing a test that proves nothing in the database has changed between the start of the test and the end of the test?
The database is postgres. We currently have it configured for 2 databases (the database proper and a "validation" database which I'd created after we discovered the loading code had side-effects in dry-run mode. I'm in the middle of a refactor that has fixed the side-effects issue and I just added the atomic transaction block. So I would like to write a test like this:
def test_no_side_effects_in_dry_run_mode(self):
orig_db_state = self.get_db_state() # How do I write this?
call_command(
"load_animals_and_samples",
animal_and_sample_table_filename=("data.xlsx"),
dry_run=True,
)
post_db_state = self.get_db_state() # How do I write this?
self.assertNoDatabaseChange(orig_db_state, post_db_state, msg="Oops, database changed unexpectedly.") # How do I write this?
I have previously written a test that saves the counts of all the records in every table and saves them in a dict and then compared the dicts, but that doesn't account for fields changed in a record, and that's something I'd like to check. Besides that, a complementary insert/delete would be missed.
So is there like a file LMD I can check to see if anything at all in the database has changed?

Related

Django helper function executed only in single test when PostgreSQL used

Recently I've changed db engine from SQLite to PostgreSQL. I have successfully migrated whole db design to PostgreSQL (just simple makemigaretions, migrate). Once I ran tests some have failed for some unknown reason (errors suggest that some objects were not created). Failure doesn't apply to all tests, just selected few. Everything has been working before.
I'd start investigating what's going on on test-by-test basis, but some bizarre behavior has appeared. Let's say my test is in class MyTestClass and test is called test_do_something and in MyTestClass there are other tests as well.
When I'm running python manage.py test MyTestClass I'm getting info that test_do_something has failed.
When I'm running python manage.py test MyTestClass.test_do_something everything passes.
On SQLite both ways pass.
I'm assuming that setUpTestData() and setUp() methods work same way in SQLite and PostgreSQL. Or they don't?
Any clue why such discrepancy might be happening?
EDIT
I think I've noticed what might be wrong, but I don't understand why. The problem is because my function I've used to call to create object which is later used only once. Which differs from SQLite execution.
What I mean in my test I have something like this:
def create_object(self):
self.client.post(reverse('myurl'), kwargs={'myargs':arg})
def test_mytest1(self):
# Do something
self.create_object()
# Do something
def test_mytest2(self):
# Do something
self.create_object()
# Do something
def test_mytest3(self):
# Do something
self.create_object()
# Do something
Only for one test create_object() will be executed.
I believe I've fond cause of those failures. As a matter of fact it hasn't been an issue with one-time execution of support function as I expected. The problem has been with hardcoded ids I've used for various reasons. It appears that object I've been hoping to see didn't exist.
Let me explain a bit more what I've experienced. E.g. I had test where I've been referring to particular object passing this object id in URL kwargs. Before this operation I've created object and passed id=1 as kwargs, because I assumed that if this is only place within this test and setUp() it will be 1. It appears that with PostgreSQL it's not so clear. It seems that ids are incremented despite DB flush. Which is completely different behavior then SQLite has been providing.
I'd very much appreciate if someone could provide some more detailed answer why is this happening. Is ID counter not zeroed in PostgreSQL on flush? It would look so.

Use atomic blocks with error handling in Django

I have a Django 1.9 application that is running a section of code where changes are made to the database based on the results of queries to certain remote APIs. For instance, this might be data about commits, file changes, reviewers, pull requests, etc. that I want to save as entities in my database.
commit_data = commit_API_client.get_commit_info(argument1, argument2)
new_commit = models.Commit.Create(**commit_data)
#if the last API failed, this will fail
#I will need to run this again to get these files, so I need
#to get the commit all over again, too
files = file_API_client.get_file_info(new_commit.id)
new_files = models.Files.Create(**files)
#do some more stuff here
It's very possible that one of the few APIs I'm calling will return some error instead of valid data. I essentially need to make this section into a single atomic transaction so that if there are no errors returned from the HTTP requests, I commit all the changes to the database. Otherwise, I might lose some data if 2 APIs return correctly, but the 3rd doesn't.
I saw that Django supports commit hooks for database transactions, but I was wondering if that would applicable to this situation and how I would go about implementing that.
If you want to make that into an atomic operation, just wrap it in transaction.atomic(). If any of your code raises an exception the whole block will roll back. If you're only doing reads from the remote API that should work fine. A typical pattern is:
def my_function():
try:
with transaction.atomic():
# Do your work here
pass
except Exception:
# Do some error handling
pass
Django does have an on_commit hook, but that isn't really applicable here. Its purpose is to run some code after a transaction has completed successfully. For example, you might use it to write some log data to a remote API if the transaction was successful.

Why should I care about django-revision operation being atomic?

I want to start using django-reversion. It seems the easiest way is to use their middleware. But it gives the following warning:
Warning: Due to changes in the Django 1.6 transaction handling, revision data will be saved in a separate database transaction to the one used to save your models, even if you set ATOMIC_REQUESTS = True.
What are the caveats if the requests are not atomic? It seems to indicate that there might be some kind of race conditions. How could they look like? What do I need to watch out for?
Thank you for your time. Sorry for spelling mistakes I'm not a native speaker.
As mentioned in the warning, due to some changes in the way django handles transactions since 1.6, the middlewares are no longer wrapped in the same transaction as the view function.
This is discussed at the following issue at django-reversion.
In practice, since the RevisionMiddleware runs outside of the transaction where the models are saved, no strict guarantee can be provided at the database level that reversion data will also be saved.
The usage of RevisionMiddleware was then discouraged. The following practice is advised:
If you need to ensure that your models and revisions are saved in the
save transaction, please use the reversion.create_revision() context
manager or decorator in combination with transaction.atomic()
This way, you can be sure that reversion_data will always be saved alongside model data. I hope this helps.

What is a sane way to perform a radical Django Model migration in a production environment?

I have an existing django web app that is in use. I have to radically migrate one key model in my design to a completely new design, but I want to cache all of the existing data for that model and migrate them to the new records in production when ready to deploy.
I can afford to bring my website down for a few hours one night and do whatever I need to do to migrate. What are some sane ways I can do this migration?
It seems any migration would need to:
1) Dump all of the existing data into some format, such as SQL, JSON, XML
2) Migrate the model to the new format
3) Reload the data into the new model using a conversion script
I also thought of trying to store all of the existing data in some other model called "OldModel" (if Model is the name of the existing model) and then migrating the data live.
There is a project to help with migrations that I've heard of: South.
Having said that, I admit we've not used it. We still plan our migrations using a file of SQL statements. Madness, I know, but it has the advantage of testability. You can run it as many times as necessary during development and staging testing before the "big deploy". It can be source controlled, diffed, etc. It can also, therefore, be called from a larger deployment script. Of course, we back up production before running it :-)
If your database does journaling, using the old-fashioned method has the added advantage that there is a transaction history that can be rolled back.
Experiments we've run with JSON, XML and "OldModel" -> "NewModel" style dumps have scaled pretty poorly. Mind you, YMMV... we have quite a large database. By using a script, you can run on your production database without having to offload or reload vast amounts of data. This way even a complicated migration can take seconds, rather than hours.
There are around 5 or 6 tools to help automate some portion of migrations. Several of them are listed in this question and I'll add the others just for completeness.
Next, see S. Lott's answer to this question about migration workflows for a great idea on using version numbers in the model name to make migrations easier, including structuring a standalone script to properly convert the tables. To my mind this is vastly superior to serializing the data for export and then trying to build your new tables by importing.
Finally, I haven't been able to think of a way to do a hot migration properly and haven't seen any hints from anywhere else either, so maintenance downtime is inevitable.
Make all migrations in steps!
If you need to add a field, go ahead and add it, with a default value or being optional. This is safe.
If you need to make an existing optional field required, give it a default first.
If you need to make an existing field with a default not have a default, drop the default after fixing all the code that creates instances.
If you need to change the type of a field, add a new field that inherits the value from the current one, first. Then, run a script to update the existing instances to populate the new field. Thirdly, Remove all the code that uses the old field to use the new one. Finally, which no code is left using the original, you can drop it.
For every situation there is a small step you can make. For every bigger change, you can break it down into little ones. This is one place iterative development pays off. Keep good backups in place and don't be afraid to push often! Make the small changes quickly to see if they work.
If you are more comfortable with the Django ORM than with raw SQL, you might consider using Model -> BackupModel -> TestModel -> Model, where all but the last step can be performed without dropping data.
def backup(InModel,OutModel):
in_objs = InModel.objects.all()
for obj in in_objs:
out_obj = OutModel.convert_from(InModel,obj)
out_obj.save()
Here, you would just make sure that all your models have convert_from methods implemented. These should all be trivial conversions except for BackupModel -> TestModel. In the other cases, nothing but the class would change, all data being identically preserved.
The advantage to this is that before you go rewriting all your interfaces, you can play around with TestModel and make sure that your conversions were what you thought they'd be. If everything goes wrong, you convert from BackupModel->Model, and everything is okay. In a worst-case scenario, you give up on Django's ORM, run back to SQL, and simply rename all your tables that begin with backupmodel__* to model__* in your database.
Disclaimer: I've never done this.

Django: How can I protect against concurrent modification of database entries

If there a way to protect against concurrent modifications of the same data base entry by two or more users?
It would be acceptable to show an error message to the user performing the second commit/save operation, but data should not be silently overwritten.
I think locking the entry is not an option, as a user might use the "Back" button or simply close his browser, leaving the lock for ever.
This is how I do optimistic locking in Django:
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
The code listed above can be implemented as a method in Custom Manager.
I am making the following assumptions:
filter().update() will result in a single database query because filter is lazy
a database query is atomic
These assumptions are enough to ensure that no one else has updated the entry before. If multiple rows are updated this way you should use transactions.
WARNING Django Doc:
Be aware that the update() method is
converted directly to an SQL
statement. It is a bulk operation for
direct updates. It doesn't run any
save() methods on your models, or emit
the pre_save or post_save signals
This question is a bit old and my answer a bit late, but after what I understand this has been fixed in Django 1.4 using:
select_for_update(nowait=True)
see the docs
Returns a queryset that will lock rows until the end of the transaction, generating a SELECT ... FOR UPDATE SQL statement on supported databases.
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
Of course this will only work if the back-end support the "select for update" feature, which for example sqlite doesn't. Unfortunately: nowait=True is not supported by MySql, there you have to use: nowait=False, which will only block until the lock is released.
Actually, transactions don't help you much here ... unless you want to have transactions running over multiple HTTP requests (which you most probably don't want).
What we usually use in those cases is "Optimistic Locking". The Django ORM doesn't support that as far as I know. But there has been some discussion about adding this feature.
So you are on your own. Basically, what you should do is add a "version" field to your model and pass it to the user as a hidden field. The normal cycle for an update is :
read the data and show it to the user
user modify data
user post the data
the app saves it back in the database.
To implement optimistic locking, when you save the data, you check if the version that you got back from the user is the same as the one in the database, and then update the database and increment the version. If they are not, it means that there has been a change since the data was loaded.
You can do that with a single SQL call with something like :
UPDATE ... WHERE version = 'version_from_user';
This call will update the database only if the version is still the same.
Django 1.11 has three convenient options to handle this situation depending on your business logic requirements:
Something.objects.select_for_update() will block until the model become free
Something.objects.select_for_update(nowait=True) and catch DatabaseError if the model is currently locked for update
Something.objects.select_for_update(skip_locked=True) will not return the objects that are currently locked
In my application, which has both interactive and batch workflows on various models, I found these three options to solve most of my concurrent processing scenarios.
The "waiting" select_for_update is very convenient in sequential batch processes - I want them all to execute, but let them take their time. The nowait is used when an user wants to modify an object that is currently locked for update - I will just tell them it's being modified at this moment.
The skip_locked is useful for another type of update, when users can trigger a rescan of an object - and I don't care who triggers it, as long as it's triggered, so skip_locked allows me to silently skip the duplicated triggers.
For future reference, check out https://github.com/RobCombs/django-locking. It does locking in a way that doesn't leave everlasting locks, by a mixture of javascript unlocking when the user leaves the page, and lock timeouts (e.g. in case the user's browser crashes). The documentation is pretty complete.
You should probably use the django transaction middleware at least, even regardless of this problem.
As to your actual problem of having multiple users editing the same data... yes, use locking. OR:
Check what version a user is updating against (do this securely, so users can't simply hack the system to say they were updating the latest copy!), and only update if that version is current. Otherwise, send the user back a new page with the original version they were editing, their submitted version, and the new version(s) written by others. Ask them to merge the changes into one, completely up-to-date version. You might try to auto-merge these using a toolset like diff+patch, but you'll need to have the manual merge method working for failure cases anyway, so start with that. Also, you'll need to preserve version history, and allow admins to revert changes, in case someone unintentionally or intentionally messes up the merge. But you should probably have that anyway.
There's very likely a django app/library that does most of this for you.
Another thing to look for is the word "atomic". An atomic operation means that your database change will either happen successfully, or fail obviously. A quick search shows this question asking about atomic operations in Django.
The idea above
updated = Entry.objects.filter(Q(id=e.id) && Q(version=e.version))\
.update(updated_field=new_value, version=e.version+1)
if not updated:
raise ConcurrentModificationException()
looks great and should work fine even without serializable transactions.
The problem is how to augment the deafult .save() behavior as to not have to do manual plumbing to call the .update() method.
I looked at the Custom Manager idea.
My plan is to override the Manager _update method that is called by Model.save_base() to perform the update.
This is the current code in Django 1.3
def _update(self, values, **kwargs):
return self.get_query_set()._update(values, **kwargs)
What needs to be done IMHO is something like:
def _update(self, values, **kwargs):
#TODO Get version field value
v = self.get_version_field_value(values[0])
return self.get_query_set().filter(Q(version=v))._update(values, **kwargs)
Similar thing needs to happen on delete. However delete is a bit more difficult as Django is implementing quite some voodoo in this area through django.db.models.deletion.Collector.
It is weird that modren tool like Django lacks guidance for Optimictic Concurency Control.
I will update this post when I solve the riddle. Hopefully solution will be in a nice pythonic way that does not involve tons of coding, weird views, skipping essential pieces of Django etc.
To be safe the database needs to support transactions.
If the fields is "free-form" e.g. text etc. and you need to allow several users to be able to edit the same fields (you can't have single user ownership to the data), you could store the original data in a variable.
When the user committs, check if the input data has changed from the original data (if not, you don't need to bother the DB by rewriting old data),
if the original data compared to the current data in the db is the same you can save, if it has changed you can show the user the difference and ask the user what to do.
If the fields is numbers e.g. account balance, number of items in a store etc., you can handle it more automatically if you calculate the difference between the original value (stored when the user started filling out the form) and the new value you can start a transaction read the current value and add the difference, then end transaction. If you can't have negative values, you should abort the transaction if the result is negative, and tell the user.
I don't know django, so I can't give you teh cod3s.. ;)
From here:
How to prevent overwriting an object someone else has modified
I'm assuming that the timestamp will be held as a hidden field in the form you're trying to save the details of.
def save(self):
if(self.id):
foo = Foo.objects.get(pk=self.id)
if(foo.timestamp > self.timestamp):
raise Exception, "trying to save outdated Foo"
super(Foo, self).save()