AWS ALB health check pass HTTP but not Websocket - amazon-web-services

I have an application deployed in a Docker Swarm which have two publicly reachable services, HTTP and WS.
I created two target groups, one for each service, and the registered instances are the managers of the Docker Swarm. Then I created the ALB and added two HTTPS listeners, each one pointing to the specific target group.
Now comes the problem. The HTTP health check passes without no problem, but the Websocket check is always unhealthy, and I don't know why. According to http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html, using a HTTP/HTTPS listener should work for WS/WSS as well.
In the WS check, I have tried as path both / and the path the application is actually using /ws. Neither of them passes the health check.
It's not a problem related to firewall either. Security groups are wide open and there are no iptables rules, so connection is possible in both directions.
I launched the websocket container out of Docker Swarm, just to test if it was something related to Swarm (which I was pretty sure it was not, but hell.. for testing's sake), and it did not work either, so now I'm a little out of hope.
What configuration might I be missing, so that HTTP services work but Websocket services don't?.
UPDATE
I'm back with this issue, and after further researching, the problem seems to be the Target Group, not the ALB per se. Reading through the documentation http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html, I had forgotten to enable the stickiness option. However, I just did, and the problem persists.
UPDATE 2
It looks like the ELB is not upgrading the connection from HTTP to WebSocket.

ALBs do not support websocket health checks per:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
" Health checks do not support WebSockets."
The issue is that despite AWS claiming that the ALB supports HTTP2 in fact it downsamples everything to HTTP1 and then does it's thing then upgrades it to HTTP2 again which breaks everything.

I had a similar issue. When the ALB checks the WS service it receives an HTTP status of 101 (Switching protocols) from it. And, as noted in other answers, that's not a good-enough response for the ALB. I attempted changing the matcher code in the health check config but it doesn't allow anything outside the 200-299 range.
In my setup I had Socket.io running on top of Express.js so I solved it by changing the Socket.io path (do not confuse path with namespace) to /ws and let the Express.js answer the requests for /. Pointed the health check to / and that did it.

Related

How to create a health check for websocket in Google cloud?

I am trying to set up a load balancer in GCP for my application over websoket. When creating a health check to determine which instances are up and working, the only header I can modify for a HTTP health check is the host. So I can not add the Upgrade header and other websocket related headers (like in here) needed to establish a connection.
The documents mentions that websockets are supported by default, but does not mention how health check rule should be defined. What is the best practice for using GCP load balancer with websockets? Is there a way to work around this on my end e.g. by defining an endpoint that automatically upgrades to websocket or any other methods?
The Load balancer can accept and forward the websocket traffic to the correct service/backend. However, from my experience, Load balancer can't perform health checks on websocket.
The meaning of health check is a simple HTTP endpoint that answer "Ok, all is running well" or "Arg, I have a problem, if that continue, restart (or replace) me". Doing this in websocket (meaning continuous communication/streaming) is "strange".
If you look at the documentation, the success criteria are an answer in 200 (and you can optionally add content validation if you want). When you request a websocket endpoint, the HTTP answer is 101 Switching protocol, that is not 200, thus not valid at the Load balancer point of view.
Add a standard endpoint on your service that perform the health check inside your app and answer the correct HTTP code.

AWS Security Group connection tracking failing for responses with a body in ASP.NET Core app running in ECS + Fargate

In my application:
ASP.NET Core 3.1 with Kestrel
Running in AWS ECS + Fargate
Services run in a public subnet in the VPC
Tasks listen only in the port 80
Public Network Load Balancer with SSL termination
I want to set the Security Group to allow inbound connections from anywhere (0.0.0.0/0) to port 80, and disallow any outbound connection from inside the task (except, of course, to respond to the allowed requests).
As Security Groups are stateful, the connection tracking should allow the egress of the response to the requests.
In my case, this connection tracking only works for responses without body (just headers). When the response has a body (in my case, >1MB file), they fail. If I allow outbound TCP connections from port 80, they also fail. But if I allow outbound TCP connections for the full range of ports (0-65535), it works fine.
I guess this is because when ASP.NET Core + Kestrel writes the response body it initiates a new connection which is not recognized by the Security Group connection tracking.
Is there any way I can allow only responses to requests, and no other type of outbound connection initiated by the application?
So we're talking about something like that?
Client 11.11.11.11 ----> AWS NLB/ELB public 22.22.22.22 ----> AWS ECS network router or whatever (kubernetes) --------> ECS server instance running a server application 10.3.3.3:8080 (kubernetes pod)
Do you configure the security group on the AWS NLB or on the AWS ECS? (I guess both?)
Security groups should allow incoming traffic if you allow 0.0.0.0/0 port 80.
They are indeed stateful. They will allow the connection to proceed both ways after it is established (meaning the application can send a response).
However firewall state is not kept for more than 60 seconds typically (not sure what technology AWS is using), so the connection can be "lost" if the server takes more than 1 minute to reply. Does the HTTP server take a while to generate the response? If it's a websocket or TCP server instead, does it spend whole minutes at times without sending or receiving any traffic?
The way I see it. We've got two stateful firewalls. The first with the NLB. The second with ECS.
ECS is an equivalent to kubernetes, it must be doing a ton of iptables magic to distribute traffic and track connections. (For reference, regular kubernetes works heavily with iptables and iptables have a bunch of -very important- settings like connection durations and timeouts).
Good news is. If it breaks when you open inbound 0.0.0.0:80, but it works when you open inbound 0.0.0.0:80 + outbound 0.0.0.0:*. This is definitely an issue due to the firewall dropping the connection, most likely due to losing state. (or it's not stateful in the first place but I'm pretty sure security groups are stateful).
The drop could happen on either of the two firewalls. I've never had an issue with a single bare NLB/ELB, so my guess is the problem is in the ECS or the interaction of the two together.
Unfortunately we can't debug that and we have very little information about how this works internally. Your only option will be to work with the AWS support to investigate.

Health checking redis container with ALB

I have deployed a redis container using Amazon ECS, behind an application load balancer. It seems the health checks are failing, though the container is running and ready to accept connections. It seems to be failing because the health check is HTTP, and redis of course isn't an http server.
# Possible SECURITY ATTACK detected. It looks like somebody is sending
POST or Host: commands to Redis. This is likely due to an attacker
attempting to use Cross Protocol Scripting to compromise your Redis
instance. Connection aborted.
Fair enough.
Classic load balancers I figure would be fine since I can explicitly ping TCP. Is is feasible to use redis with ALB?
Change your health check to protocol HTTPS. All Amazon Load Balancers support this. The closer your health check is to what the user accesses the better. Checking an HTML page is better than a TCP check. Checking a page that requires backend services to respond is better. TCP will sometimes succeed even if your web server is not serving pages.
Deploy your container with nginx installed and direct the health check to nginx handling port.
I encountered a similar problem recently: My Redis container was up and working correctly, but the # Possible SECURITY ATTACK detected message appeared in the logs once every minute. The healthcheck was curl -fs http://localhost:6379 || exit 1; this was rejected by the Redis code (search for "SECURITY ATTACK").
My solution was to use a non-CURL healthcheck: redis-cli ping || exit 1 (taken from this post). The healthcheck status shows "healthy", and the logs are clean.
I know the solution above will not be sufficient for all parties, but hopefully it is useful in forming your own solution.

aws ELB log missing backend_status_code & elb_status_code

When I activated the logging on my ELB instance, I noticed that both the ELB_status_code and backend_status_code were missing.
I have a setup where the ELB redirects everything to 2 ha-proxies.
in the ha-proxy log the status code is visible.
The ELB is doing TCP:80 > TCP:80 using the proxy protocol.
Is there anything that i have to do specifically to enable the status code logging?
Those fields only apply to [HTTP listener].
When ELB is running in TCP mode, it's not aware that the protocol running through it happens to be HTTP, so those status codes can't be logged.
If you really want to see them, you'll need the ELB in HTTP mode... but whether that's the right choice depends on why you are using TCP mode -- web sockets, for example, require TCP mode.
Note also that if you switch the ELB to HTTP mode, some of the Tq/Tw/Tc/Tr/Tt the timers in the HAProxy logs will show values that are initially confusing, because the ELB holds connections open to the back-end (which is HAProxy) for reuse in a way that differs somewhat from the way browsers tend to. Logging the %Ci and %Cp parameters in HAProxy will help make some sense of these by allowing you to correlate them.
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format

Why does Elastic Load Balancing report 'Out of Service'?

I am trying to set up Elastic Load Balancing (ELB) in AWS to split the requests between multiple instances. I have created several images of my webserver based on the same AMI, and I am able to ssh into each individually and access the site via each distinct public DNS.
I have added each of my instances to the load balancer, but they all come back with the Status: Out of Service because they failed the health check. I'm mostly confused because I can access each instance from its public DNS, but I get a timeout whenever I visit the load balancer DNS name.
I've been trying to read through all the docs and googling it, but I'm stuck. Any pointers or links in the right direction would be greatly appreciated.
I contacted AWS support about this same issue. Apparently their system doesn't know how to handle cases were all of the instances behind the ELB are stopped for an extended amount of time. AWS support can manually refresh the statuses, if you need them up immediately.
The suggested fix it to de-register the ec2 instances from the ELB instead of just stopping them and re-register them when you start again.
Health check is (by default) made by accessing index.html on each instance incorporated in load balancer. If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating elastic load balancer.
Finally I got this working. The issue was with the Amazon Security Groups, because I've restricted the access to port 80 to few machines on my development area and the load balancer could not access the apache server on the instance. Once the load balancer gained access to my instance, it gets In Service.
I checked it with tail -f /var/log/apache2/access.log in my instance, to verify if the load balancer was trying to access my server, and to see the answer the server is giving to the load balancer.
Hope this helps.
If your web server is running fine, then it means the health check goes on a url that doesn't return 200.
A trick that works for me : go on the instance, type curl localhost:80/pathofyourhealthcheckurl
After you can adapt your health check url to always have a 200 response.
In my case, the rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.
I to faced same issue , i changed Ping Protocol from https to ssl .. it worked !
Go to Health Check --> click on Edit Health Check -- > change Ping protocol from HTTPS to SSL
Ping Target SSL:443
Timeout 5 seconds
Interval 30 seconds
Unhealthy Threshold 5
Healthy Threshold 10
For anyone else that sees this thread as this isn't listed:
Check that the health check is checking the port that the responding server is listening on.
E.g. node.js running on port 3000 -> Point healthcheck to port 3000;
Not port 80 or 443. Those are what your ALB will be using.
I spent a morning on this. Yes.
I would like to provide you a general way to solve this problem. When you have set up you web server like apache or nginx, try to read the access log file to see what happened. In my occasion, it report 401 error because I have add the basic auth in nginx. Of course, just like #ivankoni remind, it may because of the document you check is not exist.
I was working on the AWS Tutorial on hosting a web app and ran into this problem. Step 7b states the following:
"Set Ping Path to /. This sends queries to your default page, whether
it is named index.html or something else."
They could have put the forward slash in quotations like this "/". Make sure you have that in your health checks and not this "/." .
Adding this because I've spent hours trying to figure it out...
If you configured your health check endpoint but it still says Out of Service, it might be because your server is redirecting the request (i.e. returning a 301 or 302 response).
For example, if your endpoint is supposed to be /app/health/ but you only enter /app/health (no trailing slash) into the health check endpoint field on your ELB, you will not get a 200 response, so the health check will fail.
I had a similar issue. The problem appears to have been caused due to my using a HTTP health check and also using .htaccess to password protect the site.
I got the same error, in my case had to copy the particular html file from s3 bucket to "/var/www/html" location. The same html referenced in load balancer path.
The issue resolved after copying html file.
I had this issue too, and it was due to both my inbound and outbound rule for the Load Balancer's Security Group only allowing HTTP traffic on port 80. I needed to add another rule for HTTPS traffic on port 443.
I was also facing that same issue,
where ELB (Classic-Load-Balancer) try to request /index.html not / (root) while health check.
If it unable to find /index.html resource it says 'OutOfService'. Be Sure index.html should be available.