ECS ELB Health Checks - amazon-web-services

My main issue is trying to work out why my health checks are failing on ECS.
My setup
I have successfully set up an ECS cluster using an EC2 auto-scaling group. All the EC2 are in private subnets with NAT gateways.
I have a load-balancer all connected up to the target group which is linked to ECS.
When I try and get an HTTP response from the load balancer from my local machine, it times out. So I am obviously not getting responses back from the containers.
I have been able to ssh into the EC2 instances and confirmed the following:
ECS is deploying containers onto the EC2 instances, then after some time killing them and then firing them up again
I can curl the healthcheck endpoint from the EC2 instance (localhost) and it runs successfully
I can reach the internet from the EC2 instance, eg curl google.com returns an html response
My question is there seems to be two different types of health-check going on, and I can't figure out which is which.
ELB health-checks
The ELB seems, as far as I can tell, to use the health-checks defined in the target group.
The target group is defined as a list of EC2 instances. So does that mean the ELB is sending requests to the instances to see if they are running?
This would of course fail because we cannot guarantee that ECS will have deployed a container to each instance.
ECS health-checks
ECS however is responsible for deploying containers into these instances, in what could turn out to be a many-to-many relationship.
So surely ECS would be querying the actual running containers to find out if they are healthy and then killing them if required.
My confusion / question
I don't really understand what role the ELB has in managing the EC2 instances in this context.
It doesn't seem like the EC2 instances are being stopped and started. However from reading the docs it seems to indicate that the ASG / ELB will manage the EC2 instances and restart them if they fail the healthcheck.
Does ECS somehow override this default behaviour and take responsibility for running the healthchecks instead of the ELB?
And if not, won't the health check just fail on any EC2 instance that happens not to have a container running on it?

Related

AWS target groups turn unhealthy with no data

I have a backend server deployed on aws in a single EC2 instance via elastic beanstalk. The server has ip whitelisting and hence does not respond to ALB health checks, so all target groups always remain unhealthy.
According to the official AWS docs on health checks,
If a target group contains only unhealthy registered targets, the load balancer nodes route requests across its unhealthy targets.
This is what keeps my application running even though the ALB target groups are always unhealthy.
This changed last night and I faced an outage where all requests started getting rejected with 503s for reasons I'm not able to figure out. I was able to get things to work again by provisioning another EC2 instance by increasing minimum capacity of elastic beanstalk.
During the window of the outage, cloudwatch shows there is neither healthy nor unhealthy instances, though nothing actually changed as there was one EC2 instance running for past few months untouched.
In that gap, I can find metrics on TCP connections though:
I don't really understand what happened here, can someone explain what or how to debug this?

AWS EKS Workers never register with ELB

I have an AWS EKS cluster up and running, single worker is up and running our application. Loadbalancer was properly created in our private subnets using the service annotations and subnet tags. But, the worker node remains OutOfService. Just the default healthcheck with a ping.
Everything else appears to be working fine. I just can't get the worker to pass the healthcheck.
EDIT: I have tried with an internet-facing ELB as well, same issue. Verified that the worker instance security group is being altered to allow for all traffic from the load balancer, so it should be working, but it's not.
Figured this out. At some point while testing, I had the wrong target port in the service.

Should health-check for my Application Load Balancer be EC2 when using ECS?

I've been trying to configure a Cloudformation template for ECS along with Application Load Balancer (ALB) with dynamic ports.
Does the AutoScalingGroup's (ASG) health check type need to be EC2? The examples seem to use EC2 and when I set it to ELB the health check seems to fail.
If it does indeed need to be set to EC2 then does ECS manage the health of the containers itself and the ALB only manages the health of the container instances and not the containers?
Edit:
Having thought about this a bit more it probably makes sense to use EC2 health check since if I had multiple containers on the container instance then one unhealthy container shouldn't cause the whole container instance to go down. However if the ALB only monitors the instance then does ECS monitor the health of the containers?
Googling my question I came across this AWS blog but it references using ELB for health checks...
Your Auto Scaling Group health check is independent of the ECS/loadbalancer monitoring. I'm not exactly sure which health check setting of your ASG you mean for health checks.
In any case, for your ECS monitoring to be aware of the health of your container, you'll want to set the health check settings on your target groups that are connected to your services. ECS will use the information that's visible in the target group to kill containers that are not considered healthy.
The templates here are great:
http://templates.cloudonaut.io/en/stable/ecs/
The ECS templates for the cluster and on top of it the service include everything you need including auto-scaling, load-balancing, health-checks, you name it..
They require a bit of tweaking but they should get you started well even out of the box.
Pay attention to the stack dependencies. Before running the ecs service template, you need to install the stacks for vpc, vpc-s3-endpoint, alert,
nat-gateway (if you're building a service confined to private subnets), and the cluster layer itself.
Have fun!

Why can't my ECS service register available EC2 instances with my ELB?

I've got an EC2 launch configuration that builds the ECS optimized AMI. I've got an auto scaling group that ensures that I've got at least two available instances at all times. Finally, I've got a load balancer.
I'm trying to create an ECS service that distributes my tasks across the instances in the load balancer.
After reading the documentation for ECS load balancing, it's my understanding that my ASG should not automatically register my EC2 instances with the ELB, because ECS takes care of that. So, my ASG does not specify an ELB. Likewise, my ELB does not have any registered EC2 instances.
When I create my ECS service, I choose the ELB and also select the ecsServiceRole. After creating the service, I never see any instances available in the ECS Instances tab. The service also fails to start any tasks, with a very generic error of ...
service was unable to place a task because the resources could not be found.
I've been at this for about two days now and can't seem to figure out what configuration settings are not properly configured. Does anybody have any ideas as to what might be causing this to not work?
Update # 06/25/2015:
I think this may have something to do with the ECS_CLUSTER user data setting.
In my EC2 auto scaling launch configuration, if I leave the user data input completely empty, the instances are created with an ECS_CLUSTER value of "default". When this happens, I see an automatically-created cluster, named "default". In this default cluster, I see the instances and can register tasks with the ELB like expected. My ELB health check (HTTP) passes once the tasks are registered with the ELB and all is good in the world.
But, if I change that ECS_CLUSTER setting to something custom I never see a cluster created with that name. If I manually create a cluster with that name, the instances never become visible within the cluster. I can't ever register tasks with the ELB in this scenario.
Any ideas?
I had similar symptoms but ended up finding the answer in the log files:
/var/log/ecs/ecs-agent.2016-04-06-03:
2016-04-06T03:05:26Z [ERROR] Error registering: AccessDeniedException: User: arn:aws:sts::<removed>:assumed-role/<removed>/<removed is not authorized to perform: ecs:RegisterContainerInstance on resource: arn:aws:ecs:us-west-2:<removed:cluster/MyCluster-PROD
status code: 400, request id: <removed>
In my case, the resource existed but was not accessible. It sounds like OP is pointing at a resource that doesn't exist or isn't visible. Are your clusters and instances in the same region? The logs should confirm the details.
In response to other posts:
You do NOT need public IP addresses.
You do need: the ecsServiceRole or equivalent IAM role assigned to the EC2 instance in order to talk to the ECS service. You must also specify the ECS cluster and can be done via user data during instance launch or launch configuration definition, like so:
#!/bin/bash
echo ECS_CLUSTER=GenericSericeECSClusterPROD >> /etc/ecs/ecs.config
If you fail to do this on newly launched instances, you can do this after the instance has launched and then restart the service.
In the end, it ended up being that my EC2 instances were not being assigned public IP addresses. It appears ECS needs to be able to directly communicate with each EC2 instance, which would require each instance to have a public IP. I was not assigning my container instances public IP addresses because I thought I'd have them all behind a public load balancer, and each container instance would be private.
Another problem that might arise is not assigning a role with the proper policy to the Launch Configuration. My role didn't have the AmazonEC2ContainerServiceforEC2Role policy (or the permissions that it contains) as specified here.
You definitely do not need public IP addresses for each of your private instances. The correct (and safest) way to do this is setup a NAT Gateway and attach that gateway to the routing table that is attached to your private subnet.
This is documented in detail in the VPC documentation, specifically Scenario 2: VPC with Public and Private Subnets (NAT).
It might also be that the ECS agent creates a file in /var/lib/ecs/data that stores the cluster name.
If the agent first starts up with the cluster name of 'default', you'll need to delete this file and then restart the agent.
There where several layers of problems in our case. I will list them out so it might give you some idea of the issues to pursue.
My gaol was to have 1 ECS in 1 host. But ECS forces you to have 2 subnets under your VPC and each have 1 instance of docker host. I was trying to just have 1 docker host in 1 availability zone and could not get it to work.
Then the other issue was that the only one of the subnets had an attached internet facing gateway to it. So one of them was not accessible from public.
The end result was DNS was serving 2 IPs for my ELB. And one of the IPs would work and the other did not. So I was seeing random 404s when accessing the NLB using the public DNS.

AWS ELB not associating with EC2 everytime its switched on

I’m having a personal website hosted at AWS EC2 with ELB. Today I have started my AWS EC2 instances (I had turned off due to non usage and Of course, I can save some cost) and tried to load my website via AWS Elastic loadbalancer public dns url but it was not coming up in my browser, instead of webpage I got a blank white page. So I checked my AWS EC2 instances and ELB services.
In the Elastic Load Balancer section, I can see that the status message is showing the registered AWS EC2 instances are “Out of Service”! I tried to change the health check parameter values, nothing happened! So I deregister the EC2 instances from the loadbalancer and register the same again. After few minutes the instances are coming up to “In Service”. It took sometime because the EC2 instances should register into the loadbalancer and health check. Finally I brought my website up.
Solutions tried --
If you have launched your instance in EC2-VPC, by default, the IP address associated with your instance does not change when you stop and then start the instance. However, when you stop and then start your EC2-VPC instance, your load balancer might take sometime to recognize that the stopped instance has started. During this time your load balancer is not connected to the restarted instance. I recommend that you reregister your restarted instance with the load balancer.
My instance is in EC2-VPC and I tried the baove and when I re-register the instance falls back in the load balancer but otherwise I am just waiting to no avail. Any reason?
This is very common issue in for aws elb. What you can do is add following lines at
end of your /etc/rc.local (assuming you are running linux box)
elb-deregister-instances-from-lb <load_balancer_name> --instances <instance-id>
elb-register-instances-with-lb <load_balancer_name> --instances <instance-id>
It first deregisters your instance from elb and then registers back the instance.
Regards
Rajarshi Haldar