AWS ECS container networking, communication between services and within containers - amazon-web-services

I am having trouble understanding the point of AWS implementation of service discovery in ECS when using bridge mode, and in general a path forward to (relatively basic) container networking, despite the numerous AWS blog posts on the subject.
Service discovery seems to me about solving dynamically generated containers accessibility (in tasks), so that, similar to docker user defined networks, I can access different tasks on a cluster with a predefined canonical host name.name-space, in a VPC.
I've made sure in the VPC that:
DNS hostnames: Enabled
DNS resolution: Enabled
When service discovery is defined when using bridge mode:
it still tacks on a dynamic portion to the name i did not specify.
{
"Name": "my-service.my-namespace.",
"Type": "SRV",
"SetIdentifier": "4b46cb82ba434dasdb163c1f06ca5c083",
"MultiValueAnswer": true,
"TTL": 60,
"ResourceRecords": [
{
"Value": "1 1 27017 4b46cb82ba434dasdb163c1f06ca5c083.my-service.my-namespace."
}
],
"HealthCheckId": "862bd287-2b41-43ac-8442-a3d27042482b"
},
So i need to manually look up the record each time a service is created, or updated. I cannot dig
my-service.my-namespace for example, that record does not exist.
And:
2. every time the service is updated, the record is regenerated...
To get here i need to do:
$ aws servicediscovery list-namespaces
$ aws route53 list-resource-record-sets --hosted-zone-id $ZONE_ID --region us-east-1
My application currently accesses task hosts via injected environment variables, but if the record refreshes on every service update, this is a non starter. All documentation/forums I've come across seem to say, create some kind of dynamic SRV lookup workaround (seems hackish?) or just switch to awsvpc mode, but then why is this service available at all under bridge/host?
Clearly I'm missing something fundamental.
In addition I'm using dynamic port mapping. If I don't, things like rolling updates fail with port already in use errors. Similarly attempting to run a new instance of a task via scheduling creates the same error.
I can connect within a given docker container in a task with the internal private DNS of the instance, i.e. ip-172-31-52-141.ec2.internal, but here I'm outside of the VPC (?) i.e. now I need to be specifying the dynamically mapped port. So this is a non starter as well.
All of this sits behind a public ALB (for dynamic port resolution etc), and this has been working fine, requests from outside AWS resolve correctly to the target groups / targeted services.
If I switch to awsvpc mode, and enable service discovery, I can have multiple tasks/services communicate privately.
However what if I want to have multiple services communicate, but additionally a single service/task might house multiple docker containers, (e.g. a localized redis cache). I cannot specify the 'link' for these containers without the network mode being 'bridge' again.
here's the TLDR question:
I have 2 tasks and a service associated with each task. There may be multiple instances of each task, therefore ports need to be dynamic. In each task i have 2 containers.
What is the general approach here for allowing different services to communicate via a predefined host.namespace dns resolution, and have the containers inside each task communicate with each other?
Apologies for the long post, but as a novice to ECS/AWS, I'm really struggling here ;)
Any feedback or advice is really appreciated.

Related

AWS ECS: Load Balancing multiple containers under a single service

Suppose I have a task definition with 2 nginx containers. Now when I make a service (EC2 launch type) using this task definition and try to add an ALB to it I am able to only choose one "container to load balance". How can I set it up so I am able to load balance the other container too using the same ALB ?
What I have tried and understood is: Suppose I am running 2 tasks using above definition under a single service. So there will be 4 containers running (2 each of container_1 and container_2). When I choose the container_1 to load balance during service ALB creation, ECS will create a target group and put the instance and the dynamic ports of container_1 as targets. This target group will then be mapped to the ALB's rules. This works for me.
But now if I want to have a rule setup for my other set of containers the only way I see is to make the target group myself by hardcoding the dynamic ports of container_2's containers, which is less than ideal.
A use case that I can think of right now is suppose one container is running the frontend and another is running the backend. So I want url's like /my-app/pages/* to go to frontend containers and /my-app/apis/* to go to backend containers. I'm sure there are better examples/use cases but this is all I can think of right now.
So, how would I go about setting this up ?
Thanks!
Let's follow the documentation and recommendations from AWS on why more than 1 containers are allowed in a task. From the documentation [1]
When the following conditions are required, we recommend that you
deploy your containers in a single task definition:
Your containers share a common lifecycle (that is, they are launched
and terminated together).
Your containers must run on the same underlying host (that is, one
container references the other on a localhost port).
You require that your containers share resources.
Your containers share data volumes.
Otherwise, you should define your containers in separate tasks
definitions so that you can scale, provision, and deprovision them
separately.
[1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/application_architecture.html
For your use-case of frontend and backend applications, two different services is the right way. 2 ECS Services, with 1 TaskDefinition and 1 Container each.
As to how both should integrate with each other, via URL or service names (Microservice Mesh) is another topic and discussion.

Automated setup for multi-server RethinkDB cluster via an ECS service

I'm attempting to set up a RethinkDB cluster with 3 servers total spread evenly across 3 private subnets, each in different AZ's in a single region.
Ideally, I'd like to deploy the DB software via ECS and provision the EC2 instances with auto scaling, but I'm having trouble trying to figure out how to instruct the RethinkDB instances to join a RethinkDB cluster.
To create/join a cluster in RethinkDB, when you start up a new instance of RethinkDB, you specify host:port combination of one of the other machines in the cluster. This is where I'm running into problems. The Auto Scaling service is creating new primary ENI's for my EC2 instances and using a random IP in my subnet's range, so I can't know the IP of the EC2 instance ahead of time. On top of that, I'm using awsvpc task networking, so ECS is creating new secondary ENI's dedicated to each docker container and attaching them to the instances when it deploys them and those are also getting new IP's, which I don't know ahead of time.
So far I've worked out one possible solution, which is to not use an autoscaling group, but instead to manually deploy 3 EC2 instances across the private subnets, which would let me assign my own, predetermined, private IP. As I understand it, this still doesn't help me if I'm using awsvpc task networking though because each container running on my instances will get its own dedicated secondary ENI and I wont know the IP of that secondary ENI ahead of time. I think I can switch my task networking to bridge mode, to get around this. That way I can use the predetermined IP of the EC2 instances (the primary ENI) in the RethinkDB join command.
So In conclusion, the only way to achieve this, that I can figure out, is to not use Auto Scaling, or awsvpc task networking, both of which would otherwise be very desirable features. Can anyone think of a better way to do this?
As mentioned in the comments, this is more of an issue around the fact you need to start a single RethinkDB instance one time to bootstrap the cluster and then handle discovery of the existing cluster members when joining new members to the cluster.
I would have thought RethinkDB would have published a good pattern in their docs for this because it's going to be pretty common when setting up clusters but I couldn't see anything useful in their docs. If someone does know of an official recommendation then you should definitely use this rather than what I'm about to propose especially as I have no experience with running RethinkDB.
This is more just spit-balling here and will be completely untested (at least for now) but the principle is going to be you need to start a single, one off instance of RethinkDB to bootstrap the cluster, then have more cluster members join and then ditch the special case bootstrap member that didn't attempt to join a cluster and leave the remaining cluster members to work.
The bootstrap instance is easy enough to consider. You just need a RethinkDB container image and an ECS task that just runs it in stand-alone mode with the ECS service only running one instance of the task. To enable the second set of cluster members to easily discover cluster members including this bootstrapping instance it's probably easiest to use a service discovery mechanism such as the one offered by ECS which uses Route53 records under the covers. The ECS service should register the service in the RethinkDB namespace.
Then you should create another ECS service that's basically the same as the first but in an entrypoint script should list the services in the RethinkDB namespace and then resolve them, discarding the container's own IP address and then uses the discovered host to join to with --join when starting RethinkDB in the container.
I'd then set the non bootstrap ECS service to just 1 task at first to allow it to discover the bootstrap version and then you should be able to keep adding tasks to the service one at a time until you're happy with the size of the non bootstrapped cluster leaving you with n + 1 instances in the cluster including the original bootstrap instance.
After that I'd remove the bootstrap ECS service entirely.
If an ECS task dies in the non bootstrap ECS service dies for whatever reason it should be able to auto rejoin without any issue as it will just find a running RethinkDB task and start that.
You could probably expand the checks for which cluster member to join to by checking that the RethinkDB port is open and running before using that as a member to join so it will handle multiple tasks being started at the same time (with my original suggestion it could potentially find another task that is looking to join the cluster and try to join to that first, with them all potentially deadlocking if they all failed to randomly pick the existing cluster members by chance).
As mentioned, this answer comes with a big caveat that I haven't got any experience running RethinkDB and I've only played with the service discovery mechanism that was recently released for ECS so might be missing something here but the general principles should hold fine.

AWS: How to deploy docker instance near instantly on ec2 closest to my customer?

I have a basic docker image containing a python script that comes in under 100mb. I'm not sure what distro I'm going to use but preferably one that results in the smallest file size as possible.
The goal is to deploy a docker image on t2.nano ec2 instance but it must meet the following conditions:
from the time a customer requests access via URL, it should respond as quickly as possible, preferably under a few seconds.
the latency between the customer and newly deployed docker ec2 instance should be as small as possible, meaning ec2 on the closest availability region.
Is this possible?
It's not possible to deploy an EC2 instance in under a few seconds, especially not t2.nano type instances. EC2 instances follow the same rules as physical compute resources, thus larger instance types boot faster, and t2.nano is currently the smallest/least powerful/slowest instance size. That said, even the fastest instances would take a minute or so to be provisioned and fully booted.
It sounds like you should look into using AWS Lambda. It's their proprietary, managed, containerized compute resource service, and designed for the kind of workflow you are describing. You don't manage the containers themselves, but deploy the code (of which Python is a supported language) and its dependent libraries to the service, and it handles launching it in a container on demand, with overhead in the sub-second range.
Note that it is not designed for hosting "websites" directly, if that's your intention. Lambda functions are invoked by other AWS services, one of which is API Gateway, which would be the best route in providing a public interface to your Lambda functions. This could be used in conjunction with something like S3 static website hosting to provide the building blocks for a "serverless" web application.
As for your second question, Route 53 does support latency based routing, but I don't believe it supports API Gateway endpoints as targets yet. So if global latency is a big concern, your best bet may be to deploy a few full-time EC2 instances around the world and use latency based routing. If it's mainly static assets you're worried about, CloudFront can cache these at edge locations, as an alternative.

How to make a HTTP call reaching all instances behind amazon AWS load balancer?

I have a web app which runs behind Amazon AWS Elastic Load Balancer with 3 instances attached. The app has a /refresh endpoint to reload reference data. It need to be run whenever new data is available, which happens several times a week.
What I have been doing is assigning public address to all instances, and do refresh independently (using ec2-url/refresh). I agree with Michael's answer on a different topic, EC2 instances behind ELB shouldn't allow direct public access. Now my problem is how can I make elb-url/refresh call reaching all instances behind the load balancer?
And it would be nice if I can collect HTTP responses from multiple instances. But I don't mind doing the refresh blindly for now.
one of the way I'd solve this problem is by
writing the data to an AWS s3 bucket
triggering a AWS Lambda function automatically from the s3 write
using AWS SDK to to identify the instances attached to the ELB from the Lambda function e.g. using boto3 from python or AWS Java SDK
call /refresh on individual instances from Lambda
ensuring when a new instance is created (due to autoscaling or deployment), it fetches the data from the s3 bucket during startup
ensuring that the private subnets the instances are in allows traffic from the subnets attached to the Lambda
ensuring that the security groups attached to the instances allow traffic from the security group attached to the Lambda
the key wins of this solution are
the process is fully automated from the instant the data is written to s3,
avoids data inconsistency due to autoscaling/deployment,
simple to maintain (you don't have to hardcode instance ip addresses anywhere),
you don't have to expose instances outside the VPC
highly available (AWS ensures the Lambda is invoked on s3 write, you don't worry about running a script in an instance and ensuring the instance is up and running)
hope this is useful.
While this may not be possible given the constraints of your application & circumstances, its worth noting that best practice application architecture for instances running behind an AWS ELB (particularly if they are part of an AutoScalingGroup) is ensure that the instances are not stateful.
The idea is to make it so that you can scale out by adding new instances, or scale-in by removing instances, without compromising data integrity or performance.
One option would be to change the application to store the results of the reference data reload into an off-instance data store, such as a cache or database (e.g. Elasticache or RDS), instead of in-memory.
If the application was able to do that, then you would only need to hit the refresh endpoint on a single server - it would reload the reference data, do whatever analysis and manipulation is required to store it efficiently in a fit-for-purpose way for the application, store it to the data store, and then all instances would have access to the refreshed data via the shared data store.
While there is a latency increase adding a round-trip to a data store, it is often well worth it for the consistency of the application - under your current model, if one server lags behind the others in refreshing the reference data, if the ELB is not using sticky sessions, requests via the ELB will return inconsistent data depending on which server they are allocated to.
You can't make these requests through the load balancer, So you will have to open up the security group of the instances to allow incoming traffic from source other than the ELB. That doesn't mean you need to open it to all direct traffic though. You could simply whitelist an IP address in the security group to allow requests from your specific computer.
If you don't want to add public IP addresses to these servers then you will need to run something like a curl command on an EC2 instance inside the VPC. In that case you would only need to open the security group to allow traffic from some server (or group of servers) that exist in the VPC.
I solved it differently, without opening up new traffic in security groups or resorting to external resources like S3. It's flexible in that it will dynamically notify instances added through ECS or ASG.
The ELB's Target Group offers a feature of periodic health check to ensure instances behind it are live. This is a URL that your server responds on. The endpoint can include a timestamp parameter of the most recent configuration. Every server in the TG will receive the health check ping within the configured Interval threshold. If the parameter to the ping changes it signals a refresh.
A URL may look like:
/is-alive?last-configuration=2019-08-27T23%3A50%3A23Z
Above I passed a UTC timestamp of 2019-08-27T23:50:23Z
A service receiving the request will check if the in-memory state is at least as recent as the timestamp parameter. If not, it will refresh its state and update the timestamp. The next health-check will result in a no-op since your state was refreshed.
Implementation notes
If refreshing the state can take more time than the interval window or the TG health timeout, you need to offload it to another thread to prevent concurrent updates or outright service disruption as the health-checks need to return promptly. Otherwise the node will be considered off-line.
If you are using traffic port for this purpose, make sure the URL is secured by making it impossible to guess. Anything publicly exposed can be subject to a DoS attack.
As you are using S3 you can automate your task by using the ObjectCreated notification for S3.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-notification.html
You can install AWS CLI and write a simple Bash script that will monitor that ObjectCreated notification. Start a Cron job that will look for the S3 notification for creation of new object.
Setup a condition in that script file to curl "http: //127.0.0.1/refresh" when the script file detects new object created in S3 it will curl the 127.0.0.1/refresh and done you don't have to do that manually each time.
I personally like the answer by #redoc, but wanted to give another alternative for anyone that is interested, which is a combination of his and the accepted answer. Using SEE object creation events, you can trigger a lambda, but instead of discovering the instances and calling them, which requires the lambda to be in the vpc, you could have the lambda use SSM (aka Systems Manager) to execute commands via a powershell or bash document on EC2 instances that are targeted via tags. The document would then call 127.0.0.1/reload like the accepted answer has. The benefit of this is that your lambda doesn't have to be in the vpc, and your EC2s don't need inbound rules to allow the traffic from lambda. The downside is that it requires the instances to have the SSM agent installed, which sounds like more work than it really is. There's AWS AMIs already optimized with SSM agent stuff, but installing it yourself in the user data is very simple. Another potential downside, depending on your use case, is that it uses an exponential ramp up for simultaneous executions, which means if you're targeting 20 instances, it runs one 1, then 2 at once, then 4 at once, then 8, until they are all done, or it reaches what you set for the max. This is because of the error recovery stuff it has built in. It doesn't want to destroy all your stuff if something is wrong, like slowly putting your weight on some ice.
You could make the call multiple times in rapid succession to call all the instances behind the Load Balancer. This would work because the AWS Load Balancers use round-robin without sticky sessions by default, meaning that each call handled by the Load Balancer is dispatched to the next EC2 Instance in the list of available instances. So if you're making rapid calls, you're likely to hit all the instances.
Another option is that if your EC2 instances are fairly stable, you can create a Target Group for each EC2 Instance, and then create a listener rule on your Load Balancer to target those single instance groups based on some criteria, such as a query argument, URL or header.

How to manage and connect to dynamic IPs of EC2 instances?

When writing a web app with Django or such, what's the best way to connect to dynamic EC2 instances, such as a cluster of Redis or memcache instances? IP addresses change between reboots, etc. Elastic IPs are limited to 5 by default - what are some other options for auto-discovering/auto-updating which machines are available?
Late answer, but use Boto: http://boto.cloudhackers.com/en/latest/index.html
You can use security groups, tags, and other means to hit the EC2 API and pick the instances/IPs for each thing (DB Server, caching server, etc.) at load-time. We do this with great success in deployment, and are moving that way with our Django settings.py, as well.
One method that I heard mentioned recently in an AWS webinar was to store this sort of information in SimpleDB. Essentially, you would use SimpleDB as the central configuration location, and each instance that you launch would register its IP etc. with this configuration, so you would always have a complete description of all of your instances in one place. I haven't seen this in practice so I don't know what the best practices would be exactly, but the idea sounds reasonable. I suppose you could use SNS or something to signal all the other instances whenever the configuration changes, so everyone could refresh their in-memory cache of the configuration.
I don't know the AWS administrative APIs yet really, but there's probably an API call to list your EC2 instances, at which point you could use some sort of custom protocol to ping each of them and ask it what it is -- part of the memcache cluster, Redis, etc.
I'm having a similar problem and didn't found a solution yet because we also need to map Load Balancers addresses.
For your problem, there are two good alternatives:
If you are not using EC2 micro instances or load balancers, you should definitely use Amazon Virtual Private Cloud, because it lets you control instances IPs and routing tables (check all limitations before using this service).
If you are only using EC2 instances, you could write a script that uses the EC2 API tools to run the command ec2-describe-instances to find all instances and their public/private IPs. Then, the script could parameterize instances names to hosts and update /etc/hosts. Finally, you should put the script in the crontab of every computer/instance that need to access the EC2 instances (see ec2-describe-instances).
If you want to stay with EC2 instances (I'm in the same boat, I've read that you can do such things with their VPC or use an S3 bucket or something like that.) but with EC2, I'm in the middle of writing stuff like this...it's all really simple up till the part where you need to contact the server with a server from your data center or something. The way I'm doing it currently is using the API to create the instance and start it...then once its ready, I contact the server to execute a powershell script that I have on the server....the powershell renames the computer and reboots it...that takes care of needing the hostname and MAC for our data center firewalls. I haven't found a way yet to remotely rename a computer.
As far as knowing the IP, the elastic IPs are the way to go. They say you're only allowed 5 and gotta apply for more but we've been regularly requesting more and they give em to us..we're up to like 15 now and they haven't complained yet.
Another option if you dont' want to do all the computer renaming and such...you could use DHCP and set your computer up so when it boots it gets the computer name and everything from DHCP....I'm not sure how to do this exactly, I've come across very smart people telling me that's the way to do it during my research for Amazon.
I would definitely recommend that you get into the Amazon API...I've been working with it for less than a month and I can do all kinds of crazy things. My code can detect areas of our system that are getting stressed, spin up 10 amazon servers all configured to act as whatever needs stress relief, and be ready to send jobs to all in less than 7 minutes. Brings a tear to my eye.
The documentation is very complete...the API itself is a work of art and a joy to program against...I've very much enjoyed working with it. (and no, i dont' work for them lol)
Do it the traditional way: with DNS. This is what it was built for, so use it! When a machine boots, have it ask for the domain name(s) related to its function, and use that for your configuration. If it stops responding, re-resolve the DNS (or just do that periodically anyway).
I think route53 and the elastic load balancing stuff can be used to do this, if you want to stick to Amazon solutions.