Is it safe to allow multiple instances of a Django application to run the same database migration at the same time?
Scenario description
This is a setup where a multiple instances of a Django application are running behind a load balancer. When an updated version of the Docker container is available, each of the old Docker images are replaced with the new version.
If new Django migrations exist, they need to be run. This leads me to the question: is it safe to allow multiple containers to run the migration (python manage.py migrate) at the same time?
I have two hypothesis about what the answer to this question might be.
Yes it is safe. Due to database level locking, the migration's can't conflict and in the end, one migration script will run while the other reports that there are no migrations to apply.
No this is not safe. The two migrations can possibly conflict with each other as they try to modify the database.
No, it's not safe to run the migration in all containers at the same time, since you might end up applying the same migration twice.
There are two possible cases:
Applying the migration twice (e.g. adding a table column) violates a database constraint, so therefore only the first container that run the migration manages to finish the migration. In this case the other containers will die, although your orchestration system probably will restart them.
Applying the migration twice doesn't violate any constraint and therefore can be applied multiple times. In this case you can end up with duplicated data.
In any case, you should try to have only one container applying migrations at the same time.
Related
What is the recommended deployment strategy for running database migrations with ECS Fargate?
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
That is one possible solution. To avoid the concurrency issue you would have to add some sort of distributed locking in your container script to grab a lock from DynamoDB or something before running migrations. I've seen it done this way.
Another option I would propose is running your Django migrations from an AWS CodeBuild task. You could either trigger it manually before deployments, or automatically before a deployment as part of a larger CI/CD deployment pipeline. That way you would at least not have to worry about more than one running at a time.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
That's a problem with every database migration in every system that has ever been created. If you are very worried about it you would have to do blue-green deployments with separate databases to avoid this issue. Or you could just accept some down-time during deployments by configuring ECS to stop all old tasks before starting new ones.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
This is a good idea, but I'm not aware of any way to set DependsOn for separate tasks. The only DependsOn setting I'm aware of in ECS is for multiple containers in a single task.
So, I often am working with projects where there will be multiple dockerized django application servers behind a loadbalancer, and frequently I will need to deploy to them. I frequently use Watchtower to do pull-based deploys. I build a new image, push it to dockerhub, and watchtower is responsible for pulling those images down to the servers and replacing the running containers. That all works pretty well.
I would like to start automating the running of django migrations. One way I could accomplish this would be to simply add a run of the manage.py migrate to the entrypoint, and have every container automatically attempt a migration when the container comes online. This would work, and it would avoid the hassle of needing to come up with a way to do a lockout or leader election; but without some sort of way to prevent multiple runs, there is a risk that multiple instances of the migration could run at the same time. If I went this route, is there any chance that multiple migrations running at the same time could cause problems? Should I be looking for some other way to kick off these migrations once and only once?
Running migrations at the same time in parallel is not safe. I've tested it by running the migrate command in parallel on a data migration, and Django will run the migration twice. It will add 2 rows to the django_migrations table.
Check out this Google Groups post discussing the issue
And this article about running migrations on container statup.
We have a Clojure service that runs in a Docker Container running in Amazon ECS. When the container is deployed, the Clojure services connects to the database and always runs migrations on startup.
The problem is that if we need to rollback the code deploy, the deployed container has the old code in it, and does not have access to the rollback migrations that the latest container has.
This is a problem that doesn't happen often, but when it does, how do we perform the DB Rollback?
The best we can think of right now is to do it manually.
Anyone have experience doing this programatically?
It seems like you should consider separating your migrations from your actual deployable. Everyone will have their own preference for managing migrations but you lose flexibility when you package your migrations into your application. A dedicated migration tool can act more intelligently when on its own. For example, some database migrations are impossible to rollback without some sort of snapshot system, ex any migration that removes data. Additionally, its bad practice for your application to have the permissions necessary to perform migrations. You also cannot easily audit which user performed the migration.
I am facing a challenge here. So I inhertied the models from previous developers and the tables were not properly built. I added some constraints and new tables in order to normalize those tables. Before pushing the application to the heroku I tested it on my local machine and it actually broke my database.
Now the heroku website is already in production, so there are user information. How should i approach this, do I need to destroy the existing database and create a new one and run the migrations
Be very, very careful. Applying migrations on production servers can cause irreversible damage if you are not careful, and so you should be prepared for every possible situation.
My best recommendation would be to create an entire duplicate copy of your live DB (using Heroku this is as simple as a PG dump/backup). You can then create a new staging site using the same code, upload the backup into a new Database instance, and then test against that. Live environments are not always the same as local ones. You can then run your migrations on the staging site, and see if there are any unexpected effects (the best way to do this would be by utilizing django test cases). If there are any issues, be sure to understand how the rollback process works with django migrations.
A good tutorial that is fairly recent can be found here: https://realpython.com/django-migrations-a-primer/
On Heroku, as soon as you push new code, the web-serving instances restart... even if the underlying database schema additions/changes (via syncdb or south migrate) haven't yet been applied.
In many cases, this might just cause harmless errors undtil the syncdb/migrate is run soon afterward. But I'm concerned that in some cases, new code might half-work making unexpected changes in the pre-migration database.
What's the right way to be safe against this risk?
One technique might be to add the syncdb/migrate to the Procfile so it's run before web restart. But, in the case of multiple instances, or maybe even a case where the one old-code-instance is left running until the moment the one new-code-instance is known-up, there's still a variant of the issue where code is talking to a DB with a mismatched schema.
Is there a 'hold all web instances' feature (or common best practice) for letting the migrate complete without web traffic?
Or am I being overly concerned about a risk that is negligible in practice?
The safest way to handle migrations of this nature, Heroku or no, is to strictly adopt a compatibility approach with your schema and code:
Every additive or transformative schema change must be backwards-compatible;
Every destructive schema change must be performed after the code that depends on it has been removed;
Every code change must either be:
durable against the possibility that associated schema changes have not yet been made (for instance, removing a model or a field on a model) or
made only after the associated schema change has been performed (adding a model or a field on a model)
If you need to make a significant transformation of a model, this approach might require the following steps:
Create a new database table to hold your new model structure, and deploy that migration
Create a new model with the new structure, and code to copy changes from the old model to the new model when the old model changes, and deploy that code
Execute a migration or code action to copy all old model data to the new model
Update your codebase to use the new model rather than the old model, deleting the old model, and deploy that code
Execute a migration to delete the old model structure from the database
With some thought and planning, it can be used for more drastic changes as well:
Deploy code that completely removes dependence on a section of the database, presumably replacing those sections of the site with maintenance pages
Deploy a migration that makes drastic changes that would not for whatever reason work with the above dual-model workflow
Deploy code that brings the affected sections back with the new model structure supported
This can be hard to organize and requires strict discipline and firm understanding of your code's interaction with your database, but in practice, it does allow for most changes to be made with no more downtime than the server restart itself imposes.
Looks like fast-database changeovers are the way to go, but it requires a dedicated database.
http://devcenter.heroku.com/articles/fast-database-changeovers
Alternatively, here's a tutorial for copying the data from one database (e.g., production) to another database (e.g., staging), doing the schema/data migration (e.g., using django/south), then switching the app to use the newly-updated database instance.
http://devcenter.heroku.com/articles/migrating-data-between-plans
Seems reasonable, but potentially slow if there's a large amount of data.
The recommended method is this:
Add database changes for your new features to your existing code
Make the existing code compatible with the new schema
Deploy
Add the new features to your codebase
Deploy
This means that your database changes are already in place when the code starts to require them.
However....
There's a couple of issues with this. First that I know of no development shop that is organised enough to be able to handle this, as features just get built ad-hoc, and secondly that you're not really saving anything.
Generally speaking, unless your making big changes to a massive database your changes won't take long to apply and are usually over in a couple of seconds which a developer can work around quite happily issuing restarts etc when needed. The risk being that a user might get an error page. If the changes are larger, you have some alternatives. One is using maintenance mode to turn the site off for a few seconds.
To be honest, there is no clear cut way for how to handle this nicely as by definition your code needs to be in place for your database changes to start. The best way I've found to approach the problem is to look at each change individually and work out the smoothest path for each on a case by case basis.
Rehearsing deployments on a staging environment will mitigate the risk of a deploy going bad, and give you an idea of the impact.
Heroku recently released "buildpacks" which are the scripts they use to set up an environment for your application, from managing dependencies to restarting the instances. Essentially it's a more comprehensive Procfile which you can customize.
You can fork the Python buildpack and modify the script to run in the sequence you want. Append the command you run to syncdb to the end of bin/steps/django. Commit and put this repo on Github.
Unfortunately as of now it's not possible to modify the buildpack of an existing Heroku app, so you'll have to delete it and recreate one that points to your buildpack repo:
heroku create --stack cedar --buildpack git#github.com:...
This is the best solution because it
Doesn't cost anything at all
Doesn't require you to adapt your code to Heroku
Only syncs the db once per deployment
Hope this helps.