Automated setup for multi-server RethinkDB cluster via an ECS service - amazon-web-services

I'm attempting to set up a RethinkDB cluster with 3 servers total spread evenly across 3 private subnets, each in different AZ's in a single region.
Ideally, I'd like to deploy the DB software via ECS and provision the EC2 instances with auto scaling, but I'm having trouble trying to figure out how to instruct the RethinkDB instances to join a RethinkDB cluster.
To create/join a cluster in RethinkDB, when you start up a new instance of RethinkDB, you specify host:port combination of one of the other machines in the cluster. This is where I'm running into problems. The Auto Scaling service is creating new primary ENI's for my EC2 instances and using a random IP in my subnet's range, so I can't know the IP of the EC2 instance ahead of time. On top of that, I'm using awsvpc task networking, so ECS is creating new secondary ENI's dedicated to each docker container and attaching them to the instances when it deploys them and those are also getting new IP's, which I don't know ahead of time.
So far I've worked out one possible solution, which is to not use an autoscaling group, but instead to manually deploy 3 EC2 instances across the private subnets, which would let me assign my own, predetermined, private IP. As I understand it, this still doesn't help me if I'm using awsvpc task networking though because each container running on my instances will get its own dedicated secondary ENI and I wont know the IP of that secondary ENI ahead of time. I think I can switch my task networking to bridge mode, to get around this. That way I can use the predetermined IP of the EC2 instances (the primary ENI) in the RethinkDB join command.
So In conclusion, the only way to achieve this, that I can figure out, is to not use Auto Scaling, or awsvpc task networking, both of which would otherwise be very desirable features. Can anyone think of a better way to do this?

As mentioned in the comments, this is more of an issue around the fact you need to start a single RethinkDB instance one time to bootstrap the cluster and then handle discovery of the existing cluster members when joining new members to the cluster.
I would have thought RethinkDB would have published a good pattern in their docs for this because it's going to be pretty common when setting up clusters but I couldn't see anything useful in their docs. If someone does know of an official recommendation then you should definitely use this rather than what I'm about to propose especially as I have no experience with running RethinkDB.
This is more just spit-balling here and will be completely untested (at least for now) but the principle is going to be you need to start a single, one off instance of RethinkDB to bootstrap the cluster, then have more cluster members join and then ditch the special case bootstrap member that didn't attempt to join a cluster and leave the remaining cluster members to work.
The bootstrap instance is easy enough to consider. You just need a RethinkDB container image and an ECS task that just runs it in stand-alone mode with the ECS service only running one instance of the task. To enable the second set of cluster members to easily discover cluster members including this bootstrapping instance it's probably easiest to use a service discovery mechanism such as the one offered by ECS which uses Route53 records under the covers. The ECS service should register the service in the RethinkDB namespace.
Then you should create another ECS service that's basically the same as the first but in an entrypoint script should list the services in the RethinkDB namespace and then resolve them, discarding the container's own IP address and then uses the discovered host to join to with --join when starting RethinkDB in the container.
I'd then set the non bootstrap ECS service to just 1 task at first to allow it to discover the bootstrap version and then you should be able to keep adding tasks to the service one at a time until you're happy with the size of the non bootstrapped cluster leaving you with n + 1 instances in the cluster including the original bootstrap instance.
After that I'd remove the bootstrap ECS service entirely.
If an ECS task dies in the non bootstrap ECS service dies for whatever reason it should be able to auto rejoin without any issue as it will just find a running RethinkDB task and start that.
You could probably expand the checks for which cluster member to join to by checking that the RethinkDB port is open and running before using that as a member to join so it will handle multiple tasks being started at the same time (with my original suggestion it could potentially find another task that is looking to join the cluster and try to join to that first, with them all potentially deadlocking if they all failed to randomly pick the existing cluster members by chance).
As mentioned, this answer comes with a big caveat that I haven't got any experience running RethinkDB and I've only played with the service discovery mechanism that was recently released for ECS so might be missing something here but the general principles should hold fine.

Related

How are users on an application running in a container split over the ec2 instances?

So I want to launch a web application, and run it on containers in AWS.
I want to give users access to the tool through a log in page.
I don't understand how AWS manages the relationship of containers and the instances backing them.
My main questions are -
Will multiple containers run on a single ec2 instance?
If the compute power required by a container exceeds the processing power of a single instance, and I have auto-scaling enabled, will it launch multiple instances to support a single container? or will I need to go in and upgrade my ec2 instance type?
Finally, when users log in to the app, will AWS deploy a new container for each user, and subsequently a new instance to run on? or can one container support multiple users?
Also a link to a page where I can find this information would be tremendously helpful.
I will try to answer your questions, but how #Ermiya Eskandary said, the documentation will answer all the questions about container in AWS.
Yes, if your have for example a 2gb memory and 1vcpu ec2 instance and your container need a 500mb memory and 0,25vcpu, you can run a lot of containers inside EC2. You can set the task placement group to tell AWS how to handle container into EC2: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement.html
No, if your container size exceeds your ec2 instance, is impossible to share the resources of multiple ec2 to hold on one single container. If you are using ecs core (ec2 mode), the ec2 size always need to be bigger than the container.
No, you will use one container to attend multiple clients, if you are running out resources, your ASG will increase number of tasks running, using the rule that i said in the first point.
To finish, based on my experience, if your use case don't need to work with cores of machine, using a custom AMI or any other thing in the infrastructure level (linux/windows), i would use ec2 fargate.
Fargate have less operational overhead, since you need to orchestrate auto scalling group both in ec2 and in your tasks using ecs with ec2.

ECS is there a way to avoid downtime when I change instance type on Cloudformation?

I have created a cluster to run our test environment on Aws ECS everything seems to work fine including zero downtime deploy, But I realised that when I change instance types on Cloudformation for this cluster it brings all the instances down and my ELB starts to fail because there's no instances running to serve this requests.
The cluster is running using spot instances so my question is there by any chance a way to update instance types for spot instances without having the whole cluster down?
Do you have an AutoScaling group? This would allow you to change the launch template or config to have the new instances type. Then you would set the ASG desired and minimum counts to a higher number. Let the new instance type spin up, go into service in the target group. Then just delete the old instance and set your Auto scaling metrics back to normal.
Without an ASG, you could launch a new instance manually, place that instance in the ECS target group. Confirm that it joins the cluster and is running your service and task. Then delete the old instance.
You might want to break this activity in smaller chunks and do it one by one. You can write small cloudformation template as well because by default if you update the instance type then your instances will be restarted and to avoid zero downtime, you might have to do it one at a time.
However, there are two other ways that I can think of here but both will cost you money.
ASG: Create a new autoscaling group or use the existing one and change the launch configuration.
Blue/Green Deployment: Create the exact set of resources but this time with updated instance type and use Route53's weighted routing policy to control the traffic.
It solely depends upon the requirement, if you can pour money then go with above two approaches otherwise stick with the small deployments.

Replace ECS container instances in terraform setup

We have a terraform deployment that creates an auto-scaling group for EC2 instances that we use as docker hosts in an ECS cluster. On the cluster there are tasks running. Replacing the tasks (e.g. with a newer version) works fine (by creating a new task definition revision and updating the service -- AWS will perform a rolling update). However, how can I easily replace the EC2 host instances with newer ones without any downtime?
I'd like to do this to e.g. have a change to the ASG launch configuration take effect, for example switching to a different EC2 instance type.
I've tried a few things, here's what I think gets closest to what I want:
Drain one instance. The tasks will be distributed to the remaining instances.
Once no tasks are running in that instance anymore, terminate it.
Wait for the ASG to spin up a new instance.
Repeat steps 1 to 3 until all instances are new.
This works almost. The problem is that:
It's manual and therefore error prone.
After this process one of the instances (the last one that was spun up) is running 0 (zero) tasks.
Is there a better, automated way of doing this? Also, is there a way to re-distribute the tasks in an ECS cluster (without creating a new task revision)?
Prior to making changes make sure you have the ASG spanned across multiple availability zones and so are the containers. This ensures High Availability when instances are down in one Zone.
You can configure an update policy of Autoscaling group with AutoScalingRollingUpgrade where you can set MinInstanceInService and MinSuccessfulInstancesPercent to a higher value to maintain slow and safe rolling upgrade.
You may go through this documentation to find further tweaks. To automate this process, you can use terraform to update the ASG launch configuration, this will update the ASG with a new version of launch configuration and trigger a rolling upgrade.

Kubernetes - adding more nodes

I have a basic cluster, which has a master and 2 nodes. The 2 nodes are part of an aws autoscaling group - asg1. These 2 nodes are running application1.
I need to be able to have further nodes, that are running application2 be added to the cluster.
Ideally, I'm looking to maybe have a multi-region setup, whereby aplication2 can be run in multiple regions, but be part of the same cluster (not sure if that is possible).
So my question is, how do I add nodes to a cluster, more specifically in AWS?
I've seen a couple of articles whereby people have spun up the instances and then manually logged in to install the kubeltet and various other things, but I was wondering if it could be done in more of an automatic way?
Thanks
If you followed this instructions, you should have an autoscaling group for your minions.
Go to AWS panel, and scale up the autoscaling group. That should do it.
If you did it somehow manually, you can clone a machine selecting an existing minion/slave, and choosing "launch more like this".
As Pablo said, you should be able to add new nodes (in the same availability zone) by scaling up your existing ASG. This will provision new nodes that will be available for you to run application2. Unless your applications can't share the same nodes, you may also be able to run application2 on your existing nodes without provisioning new nodes if your nodes are big enough. In some cases this can be more cost effective than adding additional small nodes to your cluster.
To your other question, Kubernetes isn't designed to be run across regions. You can run a multi-zone configuration (in the same region) for higher availability applications (which is called Ubernetes Lite). Support for cross-region application deployments (Ubernetes) is currently being designed.

AWS Elastic Beanstalk - why would I use leader_only for a command?

I am writing a django app which I plan on deploying to AWS via Elastic Beanstalk. I am trying to understand why I would need to specify 'leader_only' for a container command I want to run for my app. More details about this can be found here.
It says:
Additionally, you can use leader_only. One instance is chosen to be
the leader in an Auto Scaling group. If the leader_only value is set
to true, the command runs only on the instance that is marked as the
leader.
If I have several instances running my app because I want to scale it, wouldn't using 'leader_only' run the command on only one instance, and not affect the rest? I am probably misunderstanding the purpose of it, but that seems non-ideal because the environment in the leader may differ from the other instances, and the end user may get different results depending on which instance they happen to connect to.
From a technical point of view, elastic beanstalk is autoscaling group and when you deploy something you need to assume that potentially your commands can be executed simultaneously on several ec2 instances.
Main goal of the leader_only option is to make sure that your commands will be executed on only one ec2 instance. It is useful for use cases such as execution of the db migration scripts, creation of db, etc., that should be executed just once on one ec2. So leader_only is just a marker that some commands will be executed on this instance only.
However, you need to keep in mind, the leader attribute is set once on creation of your environment and in case if leader died and was replaced by new instance possible situation when you don't have any leaders in autoscaling group.
I've done considerable testing of this recently. Both leader_only and EB_IS_COMMAND_LEADER. Both Apache 1 and Apache 2 setups.
The two named values above can be found in many discussions, guides and documents, but the situation is basically this:
You cannot trust being able to reliably detect a leader in a multiple EC2 instance environment, except during deployment and scale up
That means you cannot use the testing of either of the values above to confirm a command will run on exactly one (not zero, not 2+) instance as part of a cron job or scheduled task.
Recent improvements and changes to the way leader status is managed may well mean that a leader is always available during deployments and scale up, but at other times, including after instance replacement, there may not be a leader instance to be found.
There are two main options available if you really need to only run a scheduled task once while managing multiple instances.
A worker environment specifically for scheduled tasks, or another external service like Lambda with EventBridge (CloudWatch Events)
Setup crons to run across all instance in deployment configs. Include a small amount of code before the cron runs which connects to the AWS api, gets a list of current instances and checks the id of the first returned against its own ID to see if it should run the cron.