ELB timeout when source host is the same as the target host - amazon-web-services

I'm having an issue where ELB gives me a timeout when connecting to a service on the same EC2 instance.
I have an ECS cluster with two EC2 instances (launched through the ECS wizard). I'm currently running two services: a RabbitMQ queue, and two Celery workers. I put an internal ELB network load balancer in front of the RabbitMQ container.
The celery worker on the other EC2 instance can connect without issues, but the worker that's on the same host as the RabbitMQ container can't connect:
[2018-01-24 12:00:55,128: ERROR/MainProcess] consumer: Cannot connect to amqp://user:**#rabbitmq-abcdefghijklmnop.elb.eu-central-1.amazonaws.com:5672//: timed out.
I've checked the flow logs for the VPC, and all packages are accepted (.157 being the EC2 instance, .136 the ELB):

A network load balancer presents the connection to the server as though it came from the client machine's IP address. Replies are magically mangled back to the correct address/port pairs by the network infrastructure.
But when the server tries to reply, it replies to that source address... and in your configuration, that source address is that same machine... which didn't try to connect to itself, it tried to connect to a different machine... so the forward path and return path source and destination address/port pairs don't correlate correctly and the connection times out.
This appears to be a limitation in Network Load Balancer. Any similarly-designed layer 3 balancer would have the same limitation.
See also https://forums.aws.amazon.com/thread.jspa?messageID=805583&#805583

Related

amqplib & AWS MQ: Socket closed abruptly during opening handshake

So the current issue I have is that before I was able to connect properly to my rabbitMQ cluster that was hosted on AWS MQ. After I changed its IP visibility to private I had to create some configuration to access the cluster from outside the VPC.
Current example of how the cluster is accessed:
mq.example.com -> Load balancer (w/target group to cluster host IP & TLS port 5671) in public VPC -> Cluster in private VPC.
I've done the same thing for the web console. Now the web console works perfectly, so the issue isn't necessarily with the load balancing or a certificate issue. I then checked out if the issue could be with the code I wrote, but that is also not the case since sometimes from inside the services it connects, but sometimes it then doesn't. It throws the error: "Socket closed abruptly during opening handshake".
I think I believe where the issue may arise from, however I don't really have a proper view on how to solve it. I believe the issue has to do with the fact that the service has go through the load balancer first before it can connect to the rabbit cluster. I just don't know what to do about it and most documentation on amqplib is obscure as it is. I haven't found any (documented) similar issue with AWS MQ & a load balancer.
So my question, specifically is: How would I be able to resolve the fact that sometimes my services connect and don't connect to the cluster when they go through the load balancer?
Good to know: I use AWS MQ for rabbit, amqplib for the client connection, amqps as the protocol, web console works with the same setup but services don't.
For people who run into this issue later on I have found a solution:
When creating a Network Load Balancer to route traffic to your cluster you have to assign it a target group. Make sure to NOT DO THIS: Do not register both port 5671 (amqps) and 443 (web console) to the same target group. During routing issues will arise like this.
Instead do the following:
Create two target groups on aws EC2:
TG1: Register: TLS - 443 (web console)
TG2: Register: TLS - 5671 (amqps)
Your NLB that is configured to simple routing & alias for IPV4 connections then needs the following listeners:
Listener 1: TLS - 443 and assign it to TG1
Listener 2: TLS - 5671 and assign it to TG2
This should then make sure whenever you connect there is no confusion for the microservice you're trying to connect to the cluster.
You can then connect to your web console with your subdomain:
eg. webconsole.example.com
and to your services: eg. amqps://cluster.example.com:5671 as host (how your host is formatted depends on the library you're using for the clientside)

aws ECS, ECS instance is not registered to ALB target group

I create ECS service and it runs 1 ecs instance and I can see the instance is registered as a target of the load balancer.
Now I trigger a Auto Scaling Group (by just incrementing desired instance count) to launch a new instance.
The instance is launched and added to the ECS cluster. (I can see it on ECS instances tab)
But the instance is not added to the ALB target. (I expect to see 2 instances in the following image, but I only see 1)
I can edit AutoScalingGroup 's target group like the following
Then I see the following .
But the health check fails. It seems the 80 port is not reachable.
Although I have port 80 open for public in the security group for the instance. (Also, instance created from ecs service uses dynamic port mapping but instance created by ALS does not)
So AutoScalingGroup can launch new instance but my load balancer never gives traffic to the new instance.
I did try https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-unhealthy-checks-ecs/?nc1=h_ls and it shows I can connect to port 80 from host to the docker container by something like curl -v http://${IPADDR}/health.
So it must be the case that there's something wrong with host port 80 (load balancer can't connect to it).
But it is also the case the security group setting is not wrong, because the working instance and this non working instance is using the same SG.
Edit
Because I used dynamic mapping, my webserver is running on some random port.
As you can see the instance started by ecs service has registered itself to target group with random port.
However instance started by ALB has registered itself to target group with port 80.
The instance will not be added to the target group if it's not healthy. So you need to fix the health check first.
From your first instance, your mapped port is 32769 so I assume if this is the same target group and if it is the same application then the port in new instance should be 32769.
When you curl the IP endpoint curl -I -v http://${IPADDR}/health. is the HTTP status code was 200, if it is 200 then it should be healthy if it's not 200 then update the backend http-status code or you can update health check HTTP status code.
I assume that you are also running ECS in both instances, so ECS create target group against each ECS services, are you running some mix services that you need target group in AS group? if you are running dynamic port then remove the health check path to traffic port.
Now if we look the offical possible causes for 502 bad Gateway
Dynamic port mapping is a feature of container instance in Amazon Elastic Container Service (Amazon ECS)
Dynamic port mapping with an Application Load Balancer makes it easier
to run multiple tasks on the same Amazon ECS service on an Amazon ECS
cluster.
With the Classic Load Balancer, you must statically map port numbers
on a container instance. The Classic Load Balancer does not allow you
to run multiple copies of a task on the same instance because the
ports conflict. An Application Load Balancer uses dynamic port mapping
so that you can run multiple tasks from a single service on the same
container instance.
Your created target group will not work with dynamic port, you have to bind the target group with ECS services.
dynamic-port-mapping-ecs
HTTP 502: Bad Gateway
Possible causes:
The load balancer received a TCP RST from the target when attempting to establish a connection.
The load balancer received an unexpected response from the target, such as "ICMP Destination unreachable (Host unreachable)", when attempting to establish a connection. Check whether traffic is allowed from the load balancer subnets to the targets on the target port.
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.
The target response is malformed or contains HTTP headers that are not valid.
The load balancer encountered an SSL handshake error or SSL handshake timeout (10 seconds) when connecting to a target.
The deregistration delay period elapsed for a request being handled by a target that was deregistered. Increase the delay period so that lengthy operations can complete.
http-502-issues
It seems you know the root cause, which is that port 80 is failing the health check and thats why it is never added to ALB. Here is what you can try
First, check that your service is listening on port 80 on the new host. You can use command like netcat
nv -v localhost 80
Once you know that the service is listening, the recommended way to allow your ALB to connect to your host is to add a Security group inbound rule for your instance to allow traffic from your ALB security group on port 80

Jenkins Agent JNLP Behind AWS NLB Connection Timout

I am trying to setup a Jenkins agent running on a Mac Mini that will connect to the Master instance in AWS using JNLP. I have created a Network Load Balancer that listens on port 50000 and forwards the traffic to the Master instance. The security group attached to the Master instance allows for traffic from the public IP of the mac mini. On the node configuration I specified the tunnel connection with the DNS name of the NLB.
While trying to connect I receive an IO Exception that the Operation timed out. When I run TCP Dump on the master instance I can see traffic coming from the mac mini to the Master instance.
If I connect the mac mini with a vpn connection I can connect skipping the NLB, so I believe something is wrong with how I setup the NLB. Is there a way to change the connection timeout to increase it, or set the TCP keepalive interfal?

Strange Ignite Client behavior; ELB Discovery fails, Client scanned subnet and joined another Server Cluster

We are using Amazon ELB Based Discovery to connect an Ignite Client to a Server Cluster. The Cluster servers in question came up, and registered themselves to an ELB - however a configuration problem with the cloud-init routine left them w/o a running Ignite session. The Client in question is in the same subnet as this cluster of Servers.
The Client seemed to keep looping through the list on the ELB, finding no viable hosts, which is expected. However, we when deregistered the Servers from the ELB and the list of servers on the ELB reached null, the Client seemed to then scan it's local subnet and found another Cluster of servers and attached to them. This was unexpected, as we were using the discoverySpi configured in line with the instructions on https://apacheignite-mix.readme.io/docs/amazon-aws#section-amazon-elb-based-discovery
Question: If the list of servers from the AWS ELB is NULL, does the Ignite Client begin a scan of the subnet? The very second the ELB list went NULL, the Client found and connected to the other Server Cluster in the subnet... the timing is highly suspicious, even though I can't see in the code that it would do this.

Websocket timeouts using AWS Application Load Balancer

I'm getting gateway time-outs when trying to use a port specifically for websockets using an Application Load Balancer inside an Elastic Beanstalk environment.
The web application and websocket server is held within a Docker container, the application runs fine however wss://domain.com:8080 will just time out.
Here is the Load balancer listeners, using the SSL cert for wss.
The target group it points to is accepting 'Protocol' of HTTP (I've tried HTTPS) and forwards to 8080 onto an EC2 instance. Or.. It should be. (Doesn't appear to be an option for TCP on Application Load Balancers).
I've had a look over the Application Load Balancer logs and it looks like the it reaches the target group, but times out between it's connection to the EC2 instance, and I'm stumped on why.
All AWS Security Groups have been opened on all traffic for the time being, I've checked the host and found that the port is open and being listened to by Nginx which will route to the correct port to the docker container:
docker ps also shows me:
And once inside the container I can see that the port is being listened to by the Websocket server:
So it can't be the EC2 instance itself, can it? Is there an issue routing websockets via ports in an ALB?
-- Edit --
Current SG of the ALB:
The EC2 instance SG:
Accepted answer here seems to be "open Security Groups for EC2 (web server) and ALB inbound & outbound communication on required ports since websockets need two way communication."
This is incorrect and the reason why it solved the problem is coincidental.
Let me explain:
"Websockets needs two way communication..." - Sure but the TCP sessions is only ever opened from one way - from the client.
You don't have to allow any outbound connections from the EC2 instance (web server) in order to use web sockets.
Of course the ALB needs to be able to do TCP connections to the EC2 instance. But not to the client. Why? Well the ALB is accepting TCP connections (usually on port 80 and 443). It is setting up a TCP session that was initiated by the client. It is then trying to set up a new TCP session to the web server behind the ALB. This should be done on the port that you decided to have the web server listening on. The Security Group around the ALB needs to be able to do outbound connections on this port to the web server. This is the reason why "open up everything" worked. It has nothing to do with "two way communication".
You could use any ports of course but you don't need to use any other ports than 80 & 443 (such as 8080) on both the Load Balancer or the EC2.
Websockets need two way communication, make sure security groups attached to all resources (EC2 & ALB) allow both inbound & outbound communication on required ports.