What is the recommended deployment strategy for running database migrations with ECS Fargate?
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
That is one possible solution. To avoid the concurrency issue you would have to add some sort of distributed locking in your container script to grab a lock from DynamoDB or something before running migrations. I've seen it done this way.
Another option I would propose is running your Django migrations from an AWS CodeBuild task. You could either trigger it manually before deployments, or automatically before a deployment as part of a larger CI/CD deployment pipeline. That way you would at least not have to worry about more than one running at a time.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
That's a problem with every database migration in every system that has ever been created. If you are very worried about it you would have to do blue-green deployments with separate databases to avoid this issue. Or you could just accept some down-time during deployments by configuring ECS to stop all old tasks before starting new ones.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
This is a good idea, but I'm not aware of any way to set DependsOn for separate tasks. The only DependsOn setting I'm aware of in ECS is for multiple containers in a single task.
Related
What are the best practices for managing django schema/data migrations in a containerized deployment? We are having issues with multiple containers attempting to run the migrate command on deploy and running our migrations in parallel. I was surprised to learn that django doesn't coordinate this internally to prevent the migrations from being run by two containers at the same time. We are using AWS ECS and struggling to automatically identify one container as the master node from which to run our migrations.
I've run into this exact problem in the past, and I was also surprised to find that Django doesn't use a database lock to manage this like other DB migration tools do.
My solution (inspired by Terraform's state lock mechanism) was to create a DynamoDB table to use for distributed locks. I'm using boto3 to perform a write to the lock table, with a ConditionExpression that the lock doesn't already exist, (with a while loop to wait if the lock does exist), then I call the Django migrate command, then I delete the lock.
I bundled that logic into a Django management command, and the startup script in my docker image I deploy to ECS runs this command before it starts the Django app.
So, I often am working with projects where there will be multiple dockerized django application servers behind a loadbalancer, and frequently I will need to deploy to them. I frequently use Watchtower to do pull-based deploys. I build a new image, push it to dockerhub, and watchtower is responsible for pulling those images down to the servers and replacing the running containers. That all works pretty well.
I would like to start automating the running of django migrations. One way I could accomplish this would be to simply add a run of the manage.py migrate to the entrypoint, and have every container automatically attempt a migration when the container comes online. This would work, and it would avoid the hassle of needing to come up with a way to do a lockout or leader election; but without some sort of way to prevent multiple runs, there is a risk that multiple instances of the migration could run at the same time. If I went this route, is there any chance that multiple migrations running at the same time could cause problems? Should I be looking for some other way to kick off these migrations once and only once?
Running migrations at the same time in parallel is not safe. I've tested it by running the migrate command in parallel on a data migration, and Django will run the migration twice. It will add 2 rows to the django_migrations table.
Check out this Google Groups post discussing the issue
And this article about running migrations on container statup.
We have a Clojure service that runs in a Docker Container running in Amazon ECS. When the container is deployed, the Clojure services connects to the database and always runs migrations on startup.
The problem is that if we need to rollback the code deploy, the deployed container has the old code in it, and does not have access to the rollback migrations that the latest container has.
This is a problem that doesn't happen often, but when it does, how do we perform the DB Rollback?
The best we can think of right now is to do it manually.
Anyone have experience doing this programatically?
It seems like you should consider separating your migrations from your actual deployable. Everyone will have their own preference for managing migrations but you lose flexibility when you package your migrations into your application. A dedicated migration tool can act more intelligently when on its own. For example, some database migrations are impossible to rollback without some sort of snapshot system, ex any migration that removes data. Additionally, its bad practice for your application to have the permissions necessary to perform migrations. You also cannot easily audit which user performed the migration.
Is it safe to allow multiple instances of a Django application to run the same database migration at the same time?
Scenario description
This is a setup where a multiple instances of a Django application are running behind a load balancer. When an updated version of the Docker container is available, each of the old Docker images are replaced with the new version.
If new Django migrations exist, they need to be run. This leads me to the question: is it safe to allow multiple containers to run the migration (python manage.py migrate) at the same time?
I have two hypothesis about what the answer to this question might be.
Yes it is safe. Due to database level locking, the migration's can't conflict and in the end, one migration script will run while the other reports that there are no migrations to apply.
No this is not safe. The two migrations can possibly conflict with each other as they try to modify the database.
No, it's not safe to run the migration in all containers at the same time, since you might end up applying the same migration twice.
There are two possible cases:
Applying the migration twice (e.g. adding a table column) violates a database constraint, so therefore only the first container that run the migration manages to finish the migration. In this case the other containers will die, although your orchestration system probably will restart them.
Applying the migration twice doesn't violate any constraint and therefore can be applied multiple times. In this case you can end up with duplicated data.
In any case, you should try to have only one container applying migrations at the same time.
We have a rails app that has been working fine for months. Today we discovered some inconsistencies with leader election. Primarily:
su - "leader_only bundle exec rake db:migrate" webapp
After many hours of trial and error (and dozens of deployments) none of the instances in our dev application run this migration. /usr/bin/leader_only looks for an environment variable that is never set on any instance (the dev app has only one instance).
Setting the application deployment to 1 instance at a time and providing the value that /usr/bin/leader_only expects as an env var works, but not as it has been and should. (Now all instances are leaders so they will fruitlessly run db:migrate and it's 1 at a time, so if we have many instances this will slow us down)
We thought maybe it was due to some issues with the code and/or app, so we rebuilt it. No change.
I even cloned our test application's RDS server and created a new application from a saved configuration, deployed a new git hash, and it never ran db:migrate as well. It attempts to and shows the leader_only line, but it never runs. That rules out code, configuration, artifacts.
Also for what its worth, it never says skipping migrations due to RAILS_SKIP_MIGRATIONS, which has a value of false. This means that it is in fact trying to run db:migrate but isn't due to not being described as the leader.
We have been in talk with the AWS support teams. It seems as though EB leader election is very fragile.
Per the tech:
Also, as explained before(Leader is the first instance in an
auto-scaling group and if it is removed we loose the leader and even
using the leader_only : true in container_commands, db:migrate doesn't
work.)
What happened is that we lost all instances. The leader is elected once, and is passed through instance rotation. If you do not lose all instances, everything is fine.
I did not mention a detail. We have many non-production environments, and through elastic beanstalk autoscaling settings, we use timed scaling to set our instance count to 0 at night, and back up to the expected 1-2 amount during the day. We do this for our dev, test, and UAT environments to make sure we dont run at full speed 24/7. Because of this, we lost the leader and never got it back.
Per the follow up from the tech:
We have a feature request in place to overcome the issue of losing the
leader when very first instance is deleted.
"Elastic Beanstalk uses leader election to determine which instance in
your worker environment queues the periodic task. Each instance
attempts to become leader by writing to a DynamoDB table. The first
instance that succeeds is the leader, and must continue to write to
the table to maintain leader status. If the leader goes out of
service, another instance quickly takes its place."
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-periodictasks
In Elastic Beanstalk, you can run a command into a single "leader" instance. Just create a .ebextensions file that contain container_commands and then deploy it. Make sure you set the leader_only value to true.
For example:
.ebextensions/00_db_migration.config
container_commands:
00_db_migrate:
command: "rake db:migrate"
leader_only: true
The working directory of this command will be your new application.
The leader instance environment variable will be set by Elastic Beanstalk agent while updating time. It will not be exported to normal ssh shell.