Cannot reach AWS Fargate task through ALB - amazon-web-services

I've got a web application running as an AWS ECS Fargate task. The task consists of 2 Docker containers - nginx exposing port 80, running as reverse proxy, forwarding queries to an asp.net core web application exposing port 5000. The url configured in nginx.conf for upstream server is 127.0.0.1:5000, and the task is setup with container networking (awsvpc).
The ECS Service is defined as an autoscaling group of 1 task. When I run the service, AWS sets up an elastic ENI with a public and private ip. I can hit that public ip in a browser and get back a response from my web app, so it seems the ECS part is setup properly.
Next - I've defined an ALB with an http port 80 listener forwarding to a target group for the ECS Service. The target group shows the private ip for the task ENI, so it appears to be setup correctly. Health checks are configured as simple "/", and the task as well as ALB target group report them to be healthy.
However - when I navigate to the DNS name for the LB, I'm unable to get a response.
Additionally - this is running in a non-default VPC. Route table includes an IGW.
Not sure what else I should be checking, so would appreciate some help in troubleshooting further.

Related

EC2 nginx config and SSL for ALB

I am trying to add an AWS ALB for my EC2 instances. I created Application Load Balancer for two EC2 instances as well as ALB with auto-scaling group but none of these works. The individual EC2 instance is running OK which I tested but the ALB public IP is getting an error page. I wonder if EC2 nginx need to configured differently and whether adding SSL to ALB or to both EC2 instances. I am hosting an react nodejs app on the ECs. Can anyone give me some direction how to troubleshoot and fix this issue? Thanks
I wonder if EC2 nginx need to configured differently and whether adding SSL to ALB or to both EC2 instances
Usually you add SSL to ALB. There are only few use-cases where SSL on instances would be needed (e.g. strict regulatory reqquirenemnts for end-to-end encryption). So in general case you would have:
Client --- (HTTPS) ---> ALB ---- (HTTP) ---> Instances

AWS: Access to the application through the Load Balancer or by the Public IP specified in the Task from the service

There is an ECS service with Auto-assign public IP enabled and configured to run the task from Task Definition - the task starts as a web application and becomes accessible by public IP.
If I create the exact same service that launches a task from the same Task Definition, but with a LoadBalancer - the task also launches and the application starts successfully, which is also seen in CloudWatch Logs, but I cannot access this application by public IP from the running task.
In the LoadBalancer Target Group, the task is registered, but after the unsuccessful Health Check the task stops (in the Target Group it passes statuses initial> unhealthy> draining). I tried to increase the health check time in the hope of reaching the task.
Also, in the Security group, which is used by the service, added in inbound rules: Custom TCP 8080 (as my application listening on the 8080 port) for security group load-balancer
The question is, should the application be accessible by Public IP specified in the Task if the service is created with the Load Balancer? I can’t understand why in service without ELB the application is accessible by Public IP from the Task while creating a service without ELB it is not available.
Moreover, because of the fact, the task stops by the health check, the task not available through the ELB also.
Followed this instruction when creating a service with ELB https://aws.amazon.com/premiumsupport/knowledge-center/create-alb-auto-register/
Please recommend in which direction to look for a solution.
If you have a load balancer then you should definitely not be exposing the task to be publicly accessible. It should be private and if possible from its security groups only allow access from the load balancer.
If the task fails its health checks then you can find out the reason it failed by checking the targets tab of the target group:
If the target shows it failed due to timeout this will be a security group issue.
If the target shows a status check this will be your application not matching the response that the health check expects

AWS ALB (Application Load Balancer) - "502 Bad Gateway" Issue

We have using multi-container docker environment for our project to deploy the microservices(Scala) in AWS. We are using AWS ECS (Elastic container service) to deploy and manage the application in AWS Cloud. We have placed 5 microservices within separate Task definition and launched it using ECS.
We have setup ALB (Application Load Balancer) and mapped with ECS and got the ALB (CName) domain. We have created new listener rules to route requests to targets API is routing (Path base routing)
http://umojify-alb-1987551880.us-east-1.elb.amazonaws.com
Finally, we got the response "502 Bad Gateway" and "Status code: 405". Please guide us on this issue.
Where and why the issue came? Is it for ALB or API?
API URL:
http://umojify-alb-1987551880.us-east-1.elb.amazonaws.com/save-user-rating
AWS ECS uses dynamic ports to connect to the microservice containers. Please check if the ports are open on the container hosts(instances). I faced the same issue and had to open all the TCP ports for the ALB. See the AWS documentation for configuring the security group rules for container instances -
AWS security group rules for container instances

How do I work out why an ECS health-check is failing?

Outline:
I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.
Current configuration:
1 VPC ( 10.0.0.0/19 )
1 Internet gateway
3 private subnets, one for each AZ in eu-west-1 (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24)
3 public subnets, one for each AZ in eu-west-1 (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
3 NAT instances, one in each of the public subnets, routing 0.0.0.0/0 to the Internet gateway and each assigned an Elastic IP
3 ECS instances, again one in each private subnet with a route to the NAT instance in the corresponding public subnet in the same AZ as the ECS instance
1 ALB load balancer (Internet facing) which is registered with my 3 public subnets
1 Target group (with no instances registered as per ECS documentation) but a health check set up on the 'traffic' port at /health
1 Service bringing up 3 tasks spread across AZs and using dynamic ports (which are then mapped to 5000 in the docker container)
Routing
Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.
Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.
Security groups
My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.
The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol
What happens
When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.
What I've tried
Tweaked the health check intervals to make ECS require about 5
minutes of solid failed health-checks before killing the task. I
thought this would eliminate it being a bit sensitive when the task
starts up? This still goes on to trigger the tear-down, despite me
being able to view the application running in my browser throughout.
Confirmed the /health url end point in a number of ways. I can retrieve it publicly via the ALB (as well as view the main app root url at '/') and curl tells me has a proper 200 OK response (which the health check is set to look for by default). I have ssh'ed into my ECS instances and performed a curl --head {url} on '/' and '/health' and both give a 200 OK response. I've even spun up another instance in the public subnet, granted it the same access as the ALB security group to my instances and been able to curl the health check from there.
Summary
I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??
For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.
edit:
By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!
More questions than answers. but maybe they will take you in the right direction.
You say that you can access the container app via the ALB, but then the node fails the heath check. The ALB should not be allowing connection to the node until it's health check succeeds. So if you are connecting to the node via the ALB, then the ALB must have tested and decided it was healthy. Is it a different health check that is killing the node ?
Have you check CloudTrail to see if it has any clues about what is triggering the tear-down ? Is the tear down being triggered by the ALB or the auto scaling group? Might it be that auto scaling group has the wrong scale-in criteria ?
Good luck

Registering an ELB to an ECS service with random host port

I'm working with the ECS service on AWS and I have this problem - the docker containers I need to run on ECS are webservices, each container should have its internal port 80 mapped to a random port on the container host. I don't want to specify the host port for the 80 container port beforehand, I'd like to let docker daemon to find a host port for the container.
But, how the ELB fits here? For me it looks that I have to know the host port to be able to create the ELB for the service.
Is it so?
This is now possible using Application load balancer
However, if you need to open up inbound traffic on the security group, security group's port mapping is not updated automatically.
ELB does not allow binding to random port.
We have recently implemented service discovery with ECS and CONSUL. We had to introduce Zuul as intermediate layer between the ELB and our apps.
ELB maps to zuul on a static port, but Zuul discovers the backend services dynamically and routes traffic.
You need a service discovery system, such as Hashicorp's Consul and then you need to integrate it with AWS infrastructure: https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/