aws ELB log missing backend_status_code & elb_status_code - amazon-web-services

When I activated the logging on my ELB instance, I noticed that both the ELB_status_code and backend_status_code were missing.
I have a setup where the ELB redirects everything to 2 ha-proxies.
in the ha-proxy log the status code is visible.
The ELB is doing TCP:80 > TCP:80 using the proxy protocol.
Is there anything that i have to do specifically to enable the status code logging?

Those fields only apply to [HTTP listener].
When ELB is running in TCP mode, it's not aware that the protocol running through it happens to be HTTP, so those status codes can't be logged.
If you really want to see them, you'll need the ELB in HTTP mode... but whether that's the right choice depends on why you are using TCP mode -- web sockets, for example, require TCP mode.
Note also that if you switch the ELB to HTTP mode, some of the Tq/Tw/Tc/Tr/Tt the timers in the HAProxy logs will show values that are initially confusing, because the ELB holds connections open to the back-end (which is HAProxy) for reuse in a way that differs somewhat from the way browsers tend to. Logging the %Ci and %Cp parameters in HAProxy will help make some sense of these by allowing you to correlate them.
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format

Related

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!
Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.
The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).
Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

AWS Security Group connection tracking failing for responses with a body in ASP.NET Core app running in ECS + Fargate

In my application:
ASP.NET Core 3.1 with Kestrel
Running in AWS ECS + Fargate
Services run in a public subnet in the VPC
Tasks listen only in the port 80
Public Network Load Balancer with SSL termination
I want to set the Security Group to allow inbound connections from anywhere (0.0.0.0/0) to port 80, and disallow any outbound connection from inside the task (except, of course, to respond to the allowed requests).
As Security Groups are stateful, the connection tracking should allow the egress of the response to the requests.
In my case, this connection tracking only works for responses without body (just headers). When the response has a body (in my case, >1MB file), they fail. If I allow outbound TCP connections from port 80, they also fail. But if I allow outbound TCP connections for the full range of ports (0-65535), it works fine.
I guess this is because when ASP.NET Core + Kestrel writes the response body it initiates a new connection which is not recognized by the Security Group connection tracking.
Is there any way I can allow only responses to requests, and no other type of outbound connection initiated by the application?
So we're talking about something like that?
Client 11.11.11.11 ----> AWS NLB/ELB public 22.22.22.22 ----> AWS ECS network router or whatever (kubernetes) --------> ECS server instance running a server application 10.3.3.3:8080 (kubernetes pod)
Do you configure the security group on the AWS NLB or on the AWS ECS? (I guess both?)
Security groups should allow incoming traffic if you allow 0.0.0.0/0 port 80.
They are indeed stateful. They will allow the connection to proceed both ways after it is established (meaning the application can send a response).
However firewall state is not kept for more than 60 seconds typically (not sure what technology AWS is using), so the connection can be "lost" if the server takes more than 1 minute to reply. Does the HTTP server take a while to generate the response? If it's a websocket or TCP server instead, does it spend whole minutes at times without sending or receiving any traffic?
The way I see it. We've got two stateful firewalls. The first with the NLB. The second with ECS.
ECS is an equivalent to kubernetes, it must be doing a ton of iptables magic to distribute traffic and track connections. (For reference, regular kubernetes works heavily with iptables and iptables have a bunch of -very important- settings like connection durations and timeouts).
Good news is. If it breaks when you open inbound 0.0.0.0:80, but it works when you open inbound 0.0.0.0:80 + outbound 0.0.0.0:*. This is definitely an issue due to the firewall dropping the connection, most likely due to losing state. (or it's not stateful in the first place but I'm pretty sure security groups are stateful).
The drop could happen on either of the two firewalls. I've never had an issue with a single bare NLB/ELB, so my guess is the problem is in the ECS or the interaction of the two together.
Unfortunately we can't debug that and we have very little information about how this works internally. Your only option will be to work with the AWS support to investigate.

Health checking redis container with ALB

I have deployed a redis container using Amazon ECS, behind an application load balancer. It seems the health checks are failing, though the container is running and ready to accept connections. It seems to be failing because the health check is HTTP, and redis of course isn't an http server.
# Possible SECURITY ATTACK detected. It looks like somebody is sending
POST or Host: commands to Redis. This is likely due to an attacker
attempting to use Cross Protocol Scripting to compromise your Redis
instance. Connection aborted.
Fair enough.
Classic load balancers I figure would be fine since I can explicitly ping TCP. Is is feasible to use redis with ALB?
Change your health check to protocol HTTPS. All Amazon Load Balancers support this. The closer your health check is to what the user accesses the better. Checking an HTML page is better than a TCP check. Checking a page that requires backend services to respond is better. TCP will sometimes succeed even if your web server is not serving pages.
Deploy your container with nginx installed and direct the health check to nginx handling port.
I encountered a similar problem recently: My Redis container was up and working correctly, but the # Possible SECURITY ATTACK detected message appeared in the logs once every minute. The healthcheck was curl -fs http://localhost:6379 || exit 1; this was rejected by the Redis code (search for "SECURITY ATTACK").
My solution was to use a non-CURL healthcheck: redis-cli ping || exit 1 (taken from this post). The healthcheck status shows "healthy", and the logs are clean.
I know the solution above will not be sufficient for all parties, but hopefully it is useful in forming your own solution.

AWS ALB health check pass HTTP but not Websocket

I have an application deployed in a Docker Swarm which have two publicly reachable services, HTTP and WS.
I created two target groups, one for each service, and the registered instances are the managers of the Docker Swarm. Then I created the ALB and added two HTTPS listeners, each one pointing to the specific target group.
Now comes the problem. The HTTP health check passes without no problem, but the Websocket check is always unhealthy, and I don't know why. According to http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html, using a HTTP/HTTPS listener should work for WS/WSS as well.
In the WS check, I have tried as path both / and the path the application is actually using /ws. Neither of them passes the health check.
It's not a problem related to firewall either. Security groups are wide open and there are no iptables rules, so connection is possible in both directions.
I launched the websocket container out of Docker Swarm, just to test if it was something related to Swarm (which I was pretty sure it was not, but hell.. for testing's sake), and it did not work either, so now I'm a little out of hope.
What configuration might I be missing, so that HTTP services work but Websocket services don't?.
UPDATE
I'm back with this issue, and after further researching, the problem seems to be the Target Group, not the ALB per se. Reading through the documentation http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html, I had forgotten to enable the stickiness option. However, I just did, and the problem persists.
UPDATE 2
It looks like the ELB is not upgrading the connection from HTTP to WebSocket.
ALBs do not support websocket health checks per:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
" Health checks do not support WebSockets."
The issue is that despite AWS claiming that the ALB supports HTTP2 in fact it downsamples everything to HTTP1 and then does it's thing then upgrades it to HTTP2 again which breaks everything.
I had a similar issue. When the ALB checks the WS service it receives an HTTP status of 101 (Switching protocols) from it. And, as noted in other answers, that's not a good-enough response for the ALB. I attempted changing the matcher code in the health check config but it doesn't allow anything outside the 200-299 range.
In my setup I had Socket.io running on top of Express.js so I solved it by changing the Socket.io path (do not confuse path with namespace) to /ws and let the Express.js answer the requests for /. Pointed the health check to / and that did it.

AWS CloudWatch Web Server Metrics

I have a few EC2 instances with NGINX installed using both ports 80 and 443. The instances are serving different applications so I'm not using an ELB.
I would like to create a CloudWatch alarm to make sure port 80 is always returning 200 HTTP status code. I realize there are several commercial solutions for this such as New Relic, etc, but this is the task I have at hand at the moment.
None of the EC2 metrics look to be able to accomplish this, and I cannot use any ELB metrics since I have no ELB.
What's the best way to resolve this?
You can definetly do this manually (send a request and update a metric directly sent to Cloudwatch). Monitor that metric.
Or you could look into Route53 health checks. You might get away with just configuring a health check there if you are already using Route53:
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html
Create a Route53 Heath Check. Supported protocols are TCP, HTTP, and HTTPS.
The HTTP/S protocol supports matching the response payload against a user-defined string so you can not only react to connectivity problems but also to unexpected content being returned to users.
For a more advanced monitoring enable Latency metrics which collect TTFB (time to first byte) and SSL handshake times.
You can then create alarms to get alerts when one your apps becomes inaccessible.