We have a rails app that has been working fine for months. Today we discovered some inconsistencies with leader election. Primarily:
su - "leader_only bundle exec rake db:migrate" webapp
After many hours of trial and error (and dozens of deployments) none of the instances in our dev application run this migration. /usr/bin/leader_only looks for an environment variable that is never set on any instance (the dev app has only one instance).
Setting the application deployment to 1 instance at a time and providing the value that /usr/bin/leader_only expects as an env var works, but not as it has been and should. (Now all instances are leaders so they will fruitlessly run db:migrate and it's 1 at a time, so if we have many instances this will slow us down)
We thought maybe it was due to some issues with the code and/or app, so we rebuilt it. No change.
I even cloned our test application's RDS server and created a new application from a saved configuration, deployed a new git hash, and it never ran db:migrate as well. It attempts to and shows the leader_only line, but it never runs. That rules out code, configuration, artifacts.
Also for what its worth, it never says skipping migrations due to RAILS_SKIP_MIGRATIONS, which has a value of false. This means that it is in fact trying to run db:migrate but isn't due to not being described as the leader.
We have been in talk with the AWS support teams. It seems as though EB leader election is very fragile.
Per the tech:
Also, as explained before(Leader is the first instance in an
auto-scaling group and if it is removed we loose the leader and even
using the leader_only : true in container_commands, db:migrate doesn't
work.)
What happened is that we lost all instances. The leader is elected once, and is passed through instance rotation. If you do not lose all instances, everything is fine.
I did not mention a detail. We have many non-production environments, and through elastic beanstalk autoscaling settings, we use timed scaling to set our instance count to 0 at night, and back up to the expected 1-2 amount during the day. We do this for our dev, test, and UAT environments to make sure we dont run at full speed 24/7. Because of this, we lost the leader and never got it back.
Per the follow up from the tech:
We have a feature request in place to overcome the issue of losing the
leader when very first instance is deleted.
"Elastic Beanstalk uses leader election to determine which instance in
your worker environment queues the periodic task. Each instance
attempts to become leader by writing to a DynamoDB table. The first
instance that succeeds is the leader, and must continue to write to
the table to maintain leader status. If the leader goes out of
service, another instance quickly takes its place."
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-periodictasks
In Elastic Beanstalk, you can run a command into a single "leader" instance. Just create a .ebextensions file that contain container_commands and then deploy it. Make sure you set the leader_only value to true.
For example:
.ebextensions/00_db_migration.config
container_commands:
00_db_migrate:
command: "rake db:migrate"
leader_only: true
The working directory of this command will be your new application.
The leader instance environment variable will be set by Elastic Beanstalk agent while updating time. It will not be exported to normal ssh shell.
Related
What is the recommended deployment strategy for running database migrations with ECS Fargate?
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
I could update the container command to run migrations before starting the gunicorn server. But this can result in concurrent migrations executing at the same time if more than one instance is provisioned.
That is one possible solution. To avoid the concurrency issue you would have to add some sort of distributed locking in your container script to grab a lock from DynamoDB or something before running migrations. I've seen it done this way.
Another option I would propose is running your Django migrations from an AWS CodeBuild task. You could either trigger it manually before deployments, or automatically before a deployment as part of a larger CI/CD deployment pipeline. That way you would at least not have to worry about more than one running at a time.
I also have to consider the fact that images are already running. If I figure out how to run migrations before the new images are up and running, I have to consider the fact that the old images are still running on old code and may potentially break or cause strange data-corruption side effects.
That's a problem with every database migration in every system that has ever been created. If you are very worried about it you would have to do blue-green deployments with separate databases to avoid this issue. Or you could just accept some down-time during deployments by configuring ECS to stop all old tasks before starting new ones.
I was thinking of creating a new ECS::TaskDefinition. Have that run a one-off migration script that runs the migrations. Then the container closes. And I update all of the other TaskDefinitions to have a DependsOn for it, so that they wont start until it finishes.
This is a good idea, but I'm not aware of any way to set DependsOn for separate tasks. The only DependsOn setting I'm aware of in ECS is for multiple containers in a single task.
I have two AWS Elasticache instances, One of the instances (lets say instance A) has very important data sets and connections on it, downtime is unacceptable. Because of this situation, instead of doing normal migration (like preventing source data from new writes, getting dump on it, and restore it to the new one) I'm trying to sync instance A's data to another Elasticache instance (lets say instance B). As I said, this process should be downtime-free. In order to do that, I tried RedisShake, but because AWS restrict users to run certain commands (bgsave, config, replicaof,slaveof,sync etc), RedisShake is not working with AWS Elasticache. It's giving the error below.
2022/04/04 11:58:42 [PANIC] invalid psync response, continue, ERR unknown command `psync`, with args beginning with: `?`, `-1`,
[stack]:
2 github.com/alibaba/RedisShake/redis-shake/common/utils.go:252
github.com/alibaba/RedisShake/redis-shake/common.SendPSyncContinue
1 github.com/alibaba/RedisShake/redis-shake/dbSync/syncBegin.go:51
github.com/alibaba/RedisShake/redis-shake/dbSync.(*DbSyncer).sendPSyncCmd
0 github.com/alibaba/RedisShake/redis-shake/dbSync/dbSyncer.go:113
github.com/alibaba/RedisShake/redis-shake/dbSync.(*DbSyncer).Sync
... ...
I've tried rump for that matter, But it doesn't have enough stability to handle any important processes. First of all, it's not working as a background process, when the first sync finished, it's being closed with signal: exit done, so it will not be getting ongoing changes after the first finish.
Second of all, it's recognizing created/modified key/values in each run, for example, in first run key apple equals to pear, it's synced to the destination as is, but when I deleted the key apple and its value in source and ran the rump syncing script again, it's not being deleted in destination. So basically it's not literally syncing the source and the destination. Plus, last commit to the rump github repo is about 3 years ago. It seems a little bit outdated project to me.
After all this information and attempts, my question is, is there a way to sync two Elasticache for Redis instances, as I said, there is no room for downtime in my case. If you guys with this kind of experience have a bulletproof suggestion, I would be much appreciated. I tried but unfortunately didn't find any.
Thank you very much,
Best Regards.
If those two Elasticache Redis clusters exist in the same account but different regions, you can consider using AWS Elasticache global-datastore.
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastores-Console.html
It has some restrictions on the regions, type of nodes and that both the clusters should have same configurations in terms of number of nodes, etc.
Limitations - https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Redis-Global-Datastores-Getting-Started.html
Otherwise, there's a simple brute-force mechanism and you would be able to code it yourself I believe.
Create a client EC2 (let's call this Sync-er) pub-sub channel from your EC Redis instance A.
Whenever there is a new data, Sync-er would make WRITE commands on EC Redis instance B.
NOTE - You'll have to make sure that the clusters are in connectable VPCs.
Elasticache is only available to the resources within the VPC. If your Instance A and Instance B are in different VPCs, you'll have to peer them or connect them via TransitGateway.
I have a rest API running on cloud run that implements a cache, which needs to be cleared maybe once a week when I update a certain property in the database. Is there any way to send a HTTP request to all running instances of my application? Right now my understanding is even if I send multiple requests and there are 5 instances, it could all go to one instance. So is there a way to do this?
Let's go back to basics:
Cloud Run instances start based on a revision/image.
If you have the above use case, where suppose you have 5 instances running and you suddenly need to re-start them as restarting the instances resolves your use case, such as clearing/rebuilding the cache, what you need to do is:
Trigger a change in the service/config, so a new revision gets
created.
This will automatically replace, so will stop and relaunch all your instances on the fly.
You have a couple of options here, choose which is suitable for you:
if you have your services defined as yaml files, the easiest is to run the replace service command:
gcloud beta run services replace myservice.yaml
otherwise add an Environmental variable like a date that you increase, and this will yield a new revision (as a change in Env means new config, new revision) read more.
gcloud run services update SERVICE --update-env-vars KEY1=VALUE1,KEY2=VALUE2
As these operations are executed, you will see a new revision created, and your active instances will be replaced on their next request with fresh new instances that will build the new cache.
You can't reach directly all the active instance, it's the magic (and the tradeoff) of serverless: you don't really know what is running!! If you implement cache on Cloud Run, you need a way to invalidate it.
Either based on duration; when expired, refresh it
Or by invalidation. But you can't on Cloud Run.
The other way to see this use case is that you have a cache shared between all your instance, and thus you need a shared cache, something like memory store. You can have only 1 Cloud Run instance which invalidate it and recreate it and all the other instances will use it.
At the moment I have a load balancer which runs a Compute Engine Instance Group which has a minimum of 1 server and a maximum of 5 servers.
This is running auto scaling and use a pre-build ubuntu template with all the base stuff needed.
When an instance boots up it will log a runner into the GitLab project, and then trigger the job to update the instance to the latest copy of the code.
This is fine and works well.
The issue comes when I make a change to the git branch and push the changes, it only seems to be being picked up by one of the random 5 instances that have loaded.
I was under the impression that GitLab would push out to all the runners logged, but this doesn't seem to be the case.
I have seen answers on here that show multiple runners, but on a single server, I haven't come across my particular situation.
Has anyone come across this before? I would assume that this is a pretty normal situation, and weird that it doesn't just work.
For each job that runs in GitLab, only 1 runner receives the job. The mechanism is PULL based -- the runners constantly ask GitLab if there's any jobs available to run. GitLab never initiates communication with the runners.
Therefore, your load balancer rules do nothing to affect which runner receives a job and there is no "fairness" in distributing jobs across server. Runners will keep asking for jobs every few seconds as long as they are able to take them (according to concurrency settings in the config.toml) and GitLab will hand them out on a first-come, first-served basis.
If you set the concurrency to 1 and start multiple jobs, you should see multiple servers pick up the jobs.
I have an Amazon EC2 instance that I'd like to use as a development server for client projects as well as run JIRA. I have a domain pointed to the EC2 server IP. I'm new to docker so unsure if my approach is correct.
I'd like to have a JIRA container installed (with another jiradb MYSQL container) running at jira.domain.com as well as the potential to host client staging websites at client.domain.com which point to the client's docker containers.
I've been trying to use This JIRA docker image using the provided command
docker run --detach --publish 8080:8080 cptactionhank/atlassian-jira:latest
but the container always stops running mid setup (set up takes a while in-between steps). When I run the container again it goes back to the start of setup.
Once I have JIRA set up how would I run it under a subdomain? And how could I then have client.domain.com point to a separate docker container?
Thanks in advance!
As you probably know there's two considerations for getting Jira setup, whether as server or container:
You need to enter a license key early in the setup process (and it requires an Internet connection for verification), even if it's an evaluation
By default Jira will use its built-in (H2, IIRC) database, unless you configure an external one
So, in the case of 2) you probably want to make sure you have your external database ready and set up.
See Connecting Jira applications to external databases for preparatory steps for a variety of databases.
You didn't mention at what stage your first setup run fails, however once you've gotten past step 1) or any further successful setup, one of the first things I did, so as not to lose all work I'd done, was to commit the container!
docker commit -a 'My Name' -m 'Jira configured and set up' <container ID> myrepo/myjira:mytag
That way you don't lose all your previous work and you save your container into a new image in one fell swoop.