ELB Connection Draining Configuration - amazon-web-services

So, we are kinda kinda lost using the AWS ELB connection draining feature.
We have an Auto Scaling Group and we have an application that has independent sessions (A session on every instance). We configured the ELB listener over HTTP on port 80, forwarding to port 8080 (this is of course the port where the application is deployed) and we created a LBCookieStickinessPolicy. We also enabled the connection draining for 120 seconds.
The behavior we want:
We want to scale down an instance but since the session is sticked to each instance, we want to "maintain" that session during 120 seconds (Or the connection draining configuration).
The behavior we have:
We have tried to deregister, set to stanby, terminate, stop, set to unhealthy an instance. But no matter what we do, the instance shut downs immediately causing the session to end abruptly. Also, we changed the ELB listener configuration to work over TCP with no luck.
Thoughts?

Connection draining refers to open tcp connections with the client it has nothing to do with sessions on your instance. You may be able to do something with keep-alives if you do a TCP passthrough instead of http listener.
The best route to go is set up sessions to be shared between your instances and then disable stickyness on the load balancer.

Related

Network load balancer never becomes healthy

I am trying to host a service on Fargate that exposes a TCP port.
Even this simple example that exposes HTTP on port 80 never becomes healthy on Fargate.
var loadBalancedFargateService = NetworkLoadBalancedFargateService.Builder.create(this, "ServiceSample")
.cluster(fargateCluster)
.publicLoadBalancer(true)
.memoryLimitMiB(1024)
.cpu(512)
.listenerPort(80)
.taskImageOptions(NetworkLoadBalancedTaskImageOptions.builder()
.image(ContainerImage.fromRegistry("amazon/amazon-ecs-sample"))
.containerPort(80)
.build())
.build();
The error I get is:
service dev-shopapi-redis-ServiceSampleService16E525F0-ASe7w3oUlGf9 port 80 is unhealthy in target-group dev-sh-Servi-EFOUJ7LG0YPP due to (reason Health checks failed).
My intention is to expose another service with a TCP protocol and this is a simplified version that exposes HTTP.
What I am doing wrong?
Try these troubleshooting steps:
If your container is mapped to port 80, confirm that your container security group allows inbound traffic on port 80 for the
load balancer.
Confirm that the ping port value for your load balancer health is configured correctly. If this port isn't configured correctly, then
your load balancer could de-register the container from itself.
Define a minimum health check grace period.
This instructs the service scheduler to ignore Elastic Load Balancing
health checks for a pre-defined time period after a task has been
instantiated.
Monitor the CPU and memory metrics of the service. For example, high CPU can make your application unresponsive and result in a 502
error.
Check your application logs for application errors.
Check if the ping port and the health check path are configured correctly.
Unlike the ApplicationLoadBalancedFargateService, the NetworkLoadBalancedFargateService does not automatically configure the container port.
So just add the following in the CDK:
loadBalancedEcsService.getService().getConnections().allowFromAnyIpv4( Port.tcp(80) );//80 since the container is listening on port 80
Source: https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-health-check-failures/

AWS Security Group connection tracking failing for responses with a body in ASP.NET Core app running in ECS + Fargate

In my application:
ASP.NET Core 3.1 with Kestrel
Running in AWS ECS + Fargate
Services run in a public subnet in the VPC
Tasks listen only in the port 80
Public Network Load Balancer with SSL termination
I want to set the Security Group to allow inbound connections from anywhere (0.0.0.0/0) to port 80, and disallow any outbound connection from inside the task (except, of course, to respond to the allowed requests).
As Security Groups are stateful, the connection tracking should allow the egress of the response to the requests.
In my case, this connection tracking only works for responses without body (just headers). When the response has a body (in my case, >1MB file), they fail. If I allow outbound TCP connections from port 80, they also fail. But if I allow outbound TCP connections for the full range of ports (0-65535), it works fine.
I guess this is because when ASP.NET Core + Kestrel writes the response body it initiates a new connection which is not recognized by the Security Group connection tracking.
Is there any way I can allow only responses to requests, and no other type of outbound connection initiated by the application?
So we're talking about something like that?
Client 11.11.11.11 ----> AWS NLB/ELB public 22.22.22.22 ----> AWS ECS network router or whatever (kubernetes) --------> ECS server instance running a server application 10.3.3.3:8080 (kubernetes pod)
Do you configure the security group on the AWS NLB or on the AWS ECS? (I guess both?)
Security groups should allow incoming traffic if you allow 0.0.0.0/0 port 80.
They are indeed stateful. They will allow the connection to proceed both ways after it is established (meaning the application can send a response).
However firewall state is not kept for more than 60 seconds typically (not sure what technology AWS is using), so the connection can be "lost" if the server takes more than 1 minute to reply. Does the HTTP server take a while to generate the response? If it's a websocket or TCP server instead, does it spend whole minutes at times without sending or receiving any traffic?
The way I see it. We've got two stateful firewalls. The first with the NLB. The second with ECS.
ECS is an equivalent to kubernetes, it must be doing a ton of iptables magic to distribute traffic and track connections. (For reference, regular kubernetes works heavily with iptables and iptables have a bunch of -very important- settings like connection durations and timeouts).
Good news is. If it breaks when you open inbound 0.0.0.0:80, but it works when you open inbound 0.0.0.0:80 + outbound 0.0.0.0:*. This is definitely an issue due to the firewall dropping the connection, most likely due to losing state. (or it's not stateful in the first place but I'm pretty sure security groups are stateful).
The drop could happen on either of the two firewalls. I've never had an issue with a single bare NLB/ELB, so my guess is the problem is in the ECS or the interaction of the two together.
Unfortunately we can't debug that and we have very little information about how this works internally. Your only option will be to work with the AWS support to investigate.

Marking a compute instance as busy to prevent disrupting connections

I have a Golang service using TCP running on GCP's compute VMs with autoscaling. When the CPU usage spikes, new instances are created and deployed (as expected), but when the CPU usage settles again the instances are destroyed. This would be fine and it's entirely reasonable as to why this is done, but destroying instances does not take into account the established TCP connections and thus disconnects users.
I'd like to keep the VM instances running until the last connection has been closed to prevent disconnecting users. Is there a way to mark the instance as "busy" telling the autoscaler not to remove that instance until it isn't busy? I have implemented health checks but these do not signal the busyness of the instance, only whether the instance is alive or not.
You need to enable Connection Draining for your auto-scaling group:
If the group is part of a backend service that has enabled connection draining, it can take up to 60 seconds after the connection draining duration has elapsed before the VM instance is removed or deleted.
Here are the steps on how to achieve this:
Go to the Load balancing page in the Google Cloud Console.
Click the Edit button for your load balancer or create a new load balancer.
Click Backend configuration.
Click Advanced configurations at the bottom of your backend service.
In the Connection draining timeout field, enter a value from 0 - 3600. A setting of 0 disables connection draining.
Currently you can request connection draining timeout upto 3600s (= 1hour) which should be suffice for your requirements.
see: https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions

Websocket timeouts using AWS Application Load Balancer

I'm getting gateway time-outs when trying to use a port specifically for websockets using an Application Load Balancer inside an Elastic Beanstalk environment.
The web application and websocket server is held within a Docker container, the application runs fine however wss://domain.com:8080 will just time out.
Here is the Load balancer listeners, using the SSL cert for wss.
The target group it points to is accepting 'Protocol' of HTTP (I've tried HTTPS) and forwards to 8080 onto an EC2 instance. Or.. It should be. (Doesn't appear to be an option for TCP on Application Load Balancers).
I've had a look over the Application Load Balancer logs and it looks like the it reaches the target group, but times out between it's connection to the EC2 instance, and I'm stumped on why.
All AWS Security Groups have been opened on all traffic for the time being, I've checked the host and found that the port is open and being listened to by Nginx which will route to the correct port to the docker container:
docker ps also shows me:
And once inside the container I can see that the port is being listened to by the Websocket server:
So it can't be the EC2 instance itself, can it? Is there an issue routing websockets via ports in an ALB?
-- Edit --
Current SG of the ALB:
The EC2 instance SG:
Accepted answer here seems to be "open Security Groups for EC2 (web server) and ALB inbound & outbound communication on required ports since websockets need two way communication."
This is incorrect and the reason why it solved the problem is coincidental.
Let me explain:
"Websockets needs two way communication..." - Sure but the TCP sessions is only ever opened from one way - from the client.
You don't have to allow any outbound connections from the EC2 instance (web server) in order to use web sockets.
Of course the ALB needs to be able to do TCP connections to the EC2 instance. But not to the client. Why? Well the ALB is accepting TCP connections (usually on port 80 and 443). It is setting up a TCP session that was initiated by the client. It is then trying to set up a new TCP session to the web server behind the ALB. This should be done on the port that you decided to have the web server listening on. The Security Group around the ALB needs to be able to do outbound connections on this port to the web server. This is the reason why "open up everything" worked. It has nothing to do with "two way communication".
You could use any ports of course but you don't need to use any other ports than 80 & 443 (such as 8080) on both the Load Balancer or the EC2.
Websockets need two way communication, make sure security groups attached to all resources (EC2 & ALB) allow both inbound & outbound communication on required ports.

Trying to understand how does the AWS scaling work

There is one thing of scaling that I yet do not understand. Assume a simple scenario ELB -> EC2 front-end -> EC2 back-end
When there is high traffic new front-end instances are created, but, how is the connection to the back-end established?
How does the back-end application keep track of which EC2 it is receiving from, so that it can respond to the right end-user?
Moreover, what happen if a connection was established from one of the automatically created instances, and then the traffic is low again and the instance is removed.. the connection to the end-user is lost?
FWIW, the connection between the servers is through WebSocket.
Assuming that, for example, your ec2 'front-ends' are web-servers, and your back-end is a database server, when new front-end instances are spun up they must either be created from a 'gold' AMI that you previously setup with all the required software and configuration information, OR as part of the the machine starting up it must install all of your customizations (either approach is valid). with either approach they will know how to find the back-end server, either by ip address or perhaps a DNS record from the configuration information on the newly started machine.
You don't need to worry about the backend keeping track of the clients - every client talking to the back-end will have an IP address and TCPIP will take care of that handshaking for you.
As far as shutting down instances, you can enable connection draining to make sure existing conversations/connections are not lost:
When Connection Draining is enabled and configured, the process of
deregistering an instance from an Elastic Load Balancer gains an
additional step. For the duration of the configured timeout, the load
balancer will allow existing, in-flight requests made to an instance
to complete, but it will not send any new requests to the instance.
During this time, the API will report the status of the instance as
InService, along with a message stating that “Instance deregistration
currently in progress.” Once the timeout is reached, any remaining
connections will be forcibly closed.
https://aws.amazon.com/blogs/aws/elb-connection-draining-remove-instances-from-service-with-care/