Attempting to reach AWS Network Load Balancer leads to timeout - amazon-web-services

I have an internal (not Internet-facing) NLB set up in a VPC. The NLB routes to a target group containing only one target, and health checks are succeeding.
However, I am unable to make JDBC calls using the NLB's DNS. The NLB has a listener on port 10000 and I have EC2 instances running an application in the same VPC. When these EC2 instances attempt to make a JDBC call to jdbc:hive2://nlb-dns-name.com:10000/orchard, they time out trying to connect. I've logged into the EC2 instances and attempted to ping the NLB DNS record, which also times out.
Please let me know if there is something obvious I'm overlooking here. Thank you!
Edit: The EC2 instances' Security Groups allow all outbound traffic to the same VPC. The SGs of the NLB's target allow inbound traffic and the health check is passing. The NLB listens on port 10000 and routes to a target group containing the master node of one EMR cluster, which listens to JDBC connections on port 10000.
However, I'm reasonably sure the error is not in the NLB -> target routing, since the health checks pass. I believe the error is in the instance -> NLB due to the timeout, and i'm not sure if I'm doing that part correctly.

Realized I never posted an update: the issue was that I was adding my targets by instance id. For some reason, when I added them by IP address, connections started being successful.

Related

Cannot call API with ALB's DNS

I have an API on AWS ECS, connected with an Application Load Balancer. It has two target groups for blue/green deployment with CodeDeploy. The deployment works, and the targets are healthy, so I assume the app runs, and the ports are configured correctly. The port I use is 3000, and the listener is set to HTTP:3000 as well.
The load balancer is assigned to the default VPC security group, and for testing purposes I set an inbound rule to it that accepts all traffic, and the IP address is 0.0.0.0/0, so in theory it should be accessible for anyone. When I try to call the health check endpoint with {alb_dns}/rest/health (which is tested by the health checker, and it works), I get ECONNREFUSED error. Why can't I access it?

AWS Network Load Balancer failed to connect with EC2 instance if EC2 instance is not open to public

I have been struggling with this problem for 2 days but couldn't get it working.
I have this flow:
external world --> AWS API Gateway ---> VPC Link ---> Network Load Balancer ---> my single EC2 instance
Before introducing the API Gateway, I want to first make sure the Network Load Balancer --> my single EC2 instance part works.
I have set up the EC2 instance correctly. There is a Typescript / ExpressJS api service running on port 3001
I have also set up a Network Load Balancer and a Target Group, the NLB is listening and forwarding requests to port 3001 of the target group (which contains the EC2 instance).
Here is the NLB:
Note that the NLB has a VPC! This raise the question below and I find it so confusing.
listener:
You can see it is forwarding requests to docloud-backend-service, which is described as follows:
You can see that the health check has passed.
I have configured the security group of my EC2 instance with this rule:
1. Allow All protocol traffic on All ports from my VPC
(specified using CIDR notation `171.23.0.0/16`);
Now, when I do curl docloud-backend-xxxxx.elb.ap-northeast-1.amazonaws.com:3001/api/user, the command fails by timeout.
Then, after I add this rule:
2. Allow All protocol traffic on All ports from ANY source (`0.0.0.0/0`);
Now, when I do curl docloud-backend-xxxxx.elb.ap-northeast-1.amazonaws.com:3001/api/user,
the api service gets the request and I can see logs generated in the EC2 instance.
Question:
The second rule opens up the EC2 instance to public, which is dangerous.
I want to limit access to my EC2 instance port 3001 such that only the AWS API Gateway, or the NLB can access it.
The NLB has no security group to be configured. It has a VPC though. If I limit the EC2 instance such that only its own VPC can access it, it should be fine, right?
The first rule does exactly that. Why does it fail?
The NLB has a VPC. Requests go from API Gateway to NLB, then from NLB to EC2 instance. So from the EC2 instance's perspective, the requests come from an entity in the VPC. So the first rule should work, right?
Otherwise why would AWS assign a VPC to the NLB anyways?
Why would I see the VPC on the NLB's description console anyways?
I want to limit access to my EC2 instance port 3001 such that only the AWS API Gateway, or the NLB can access it.
For instance based target groups and for IP based target groups as well we can enable/disable if want to preserve the requester's IP address:
This setting can be found if go to our target group -> Actions -> Edit Target attributes.
What does this mean from the perspective of the Security Group of our application?
If we enable it (which is the default for instance type target groups), the application will see traffic as it is coming directly from the end-client. This means, you we have to enable inbound traffic for 0.0.0.0:3001.
If we disable it, the application will see the source traffic as it was coming from the private IP address of the Network Load Balancer. In this case, we can limit the inbound traffic to the private IP address of the NLB or to the CIDR range of the subnet in which the NLB is placed.

IP whitelisting for local machine on ec2 instance using inbound rules in security group

I have configured to use my ip in the security group on ec2 instance. But I am getting 504 gateway timeout error.
When I make it open to world i.e 0.0.0.0/0 then it works well.
I checked for my IP address on the ec2 instance using "who am i" and this is similar to the one in the security group.
Please suggest how to make it work only for my machine.
I have followed the steps mentioned on
possible to whitelist ip for inbound communication to an ec2 instance behind an aws load balancer?
This is how my inbound rule for the security group looks.
All traffic All All 123.201.54.223/32 Dev Security Rule
Security groups will not allow you to make it work on a machine-by-machine basis, only by IPs and security groups, eg if you limit ingress by IP, any other machine using that same IP address (usually on same network/access point etc) will also be allowed in, not just your machine.
If you are using a load balancer, then it is the load balancer that should have access to your instance via its security group, and your access via IP should be controlled in the load balancer's security group, so you should use the settings you have quoted (at least to begin with!) on your LB security group, not your instance security group.
With the instance or group of instances (ie those that are behind the load balancer) in their security groups you want to only allow ingress from the load balancer security group, there's no need to set an IP address ingress (unless you want to allow eg ssh access from specific IP addresses or want them to talk to a database instance).
504 gateway timeout error It's mean your LB not able to communicate with the desired instance and you are able to communicate with LB.
All traffic All All 123.201.54.223/32 Dev Security Rule This will only allow traffic from you IP not Load Balancer IP.
You do not need to mention your IP in the security group of EC2, You have to allow traffic from LB that is 10.0.0.0/16.
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection
because a request did not complete within the idle timeout period.
Cause 1: The application takes longer to respond than the configured
idle timeout.
Solution 1: Monitor the HTTPCode_ELB_5XX and Latency metrics. If there
is an increase in these metrics, it could be due to the application
not responding within the idle timeout period. For details about the
requests that are timing out, enable access logs on the load balancer
and review the 504 response codes in the logs that are generated by
Elastic Load Balancing. If necessary, you can increase your capacity
or increase the configured idle timeout so that lengthy operations
(such as uploading a large file) can complete. For more information,
see Configure the Idle Connection Timeout for Your Classic Load
Balancer and How do I troubleshoot Elastic Load Balancing high
latency.
Cause 2: Registered instances closing the connection to Elastic Load
Balancing.
Solution 2: Enable keep-alive settings on your EC2 instances and make
sure that the keep-alive timeout is greater than the idle timeout
settings of your load balancer.
ts-elb-errorcodes-http504

504 Gateway Time-out on ALB

I created an EC2 instance with apache installed on it and allowed HTTP/SSH traffic for my system only. I was able to access Web page using public IP of EC2. Then I configured ALB with same SG, registered same EC2 instance with ALB. Tried to access Web page using DNS name of ALB, got error -504 Gateway Time-out.
Increased time out interval to see if this resolves issue, didn't work. Then I revisited lesson and thought lets allow HTTP traffic to everyone(since it was allowed in lecture) in SG to see if it works and YES, it worked. I again changed SG to allow traffic only for my system and it failed again.
In this configuration, your security group needs to allow traffic from itself -- create rules that allow the appropriate ports, but use the security group sg-xxxx in place of an IP address, as the source. Merely being members of the same security group does not allow two systems to communicate with each other.
A better configuration would be for the balancer to have its own security group, and your instance's group would allow traffic from the balancer's group.
Note also that without the security group configuration being correct, you should also find that the health checks on the balancer are failing.

How do I work out why an ECS health-check is failing?

Outline:
I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.
Current configuration:
1 VPC ( 10.0.0.0/19 )
1 Internet gateway
3 private subnets, one for each AZ in eu-west-1 (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24)
3 public subnets, one for each AZ in eu-west-1 (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
3 NAT instances, one in each of the public subnets, routing 0.0.0.0/0 to the Internet gateway and each assigned an Elastic IP
3 ECS instances, again one in each private subnet with a route to the NAT instance in the corresponding public subnet in the same AZ as the ECS instance
1 ALB load balancer (Internet facing) which is registered with my 3 public subnets
1 Target group (with no instances registered as per ECS documentation) but a health check set up on the 'traffic' port at /health
1 Service bringing up 3 tasks spread across AZs and using dynamic ports (which are then mapped to 5000 in the docker container)
Routing
Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.
Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.
Security groups
My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.
The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol
What happens
When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.
What I've tried
Tweaked the health check intervals to make ECS require about 5
minutes of solid failed health-checks before killing the task. I
thought this would eliminate it being a bit sensitive when the task
starts up? This still goes on to trigger the tear-down, despite me
being able to view the application running in my browser throughout.
Confirmed the /health url end point in a number of ways. I can retrieve it publicly via the ALB (as well as view the main app root url at '/') and curl tells me has a proper 200 OK response (which the health check is set to look for by default). I have ssh'ed into my ECS instances and performed a curl --head {url} on '/' and '/health' and both give a 200 OK response. I've even spun up another instance in the public subnet, granted it the same access as the ALB security group to my instances and been able to curl the health check from there.
Summary
I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??
For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.
edit:
By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!
More questions than answers. but maybe they will take you in the right direction.
You say that you can access the container app via the ALB, but then the node fails the heath check. The ALB should not be allowing connection to the node until it's health check succeeds. So if you are connecting to the node via the ALB, then the ALB must have tested and decided it was healthy. Is it a different health check that is killing the node ?
Have you check CloudTrail to see if it has any clues about what is triggering the tear-down ? Is the tear down being triggered by the ALB or the auto scaling group? Might it be that auto scaling group has the wrong scale-in criteria ?
Good luck