Health checking redis container with ALB

Health checking redis container with ALB - amazon-web-services

I have deployed a redis container using Amazon ECS, behind an application load balancer. It seems the health checks are failing, though the container is running and ready to accept connections. It seems to be failing because the health check is HTTP, and redis of course isn't an http server.
# Possible SECURITY ATTACK detected. It looks like somebody is sending
POST or Host: commands to Redis. This is likely due to an attacker
attempting to use Cross Protocol Scripting to compromise your Redis
instance. Connection aborted.
Fair enough.
Classic load balancers I figure would be fine since I can explicitly ping TCP. Is is feasible to use redis with ALB?

Change your health check to protocol HTTPS. All Amazon Load Balancers support this. The closer your health check is to what the user accesses the better. Checking an HTML page is better than a TCP check. Checking a page that requires backend services to respond is better. TCP will sometimes succeed even if your web server is not serving pages.

Deploy your container with nginx installed and direct the health check to nginx handling port.

I encountered a similar problem recently: My Redis container was up and working correctly, but the # Possible SECURITY ATTACK detected message appeared in the logs once every minute. The healthcheck was curl -fs http://localhost:6379 || exit 1; this was rejected by the Redis code (search for "SECURITY ATTACK").
My solution was to use a non-CURL healthcheck: redis-cli ping || exit 1 (taken from this post). The healthcheck status shows "healthy", and the logs are clean.
I know the solution above will not be sufficient for all parties, but hopefully it is useful in forming your own solution.

Related

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!

Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.

The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).

Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

amazon aws ec2 instance status is unhealthy. Website loading time very high

I am using aws ec2 m1.medium as my webserver. From last two days website loading very slow. Health check status in amazon route53 shows Unhealthy. The following status shows in health checkers
Failure: Resolved IP: [my ip]. The endpoint did not respond to the health checker request within the timeout limit.
When i check in mxToolBox
Failure - response over threshold (12.21s/10s)
Can anybody help please.

The AWS network in itself never takes this much time to reach to the servers and it must be your application responding slow. to make sure ping the node first and then telnet/nc on the specific port that your application is using.
telnet <ip> <port>
netcat -u <ip> <port>
if you find there is a significant difference, then you need to troubleshoot your application which might be having any kind of issue.
If not, then you may have faulty ELB sitting in between that is causing such high latency and you might wanna restart/replace that.

AWS ALB health check pass HTTP but not Websocket

I have an application deployed in a Docker Swarm which have two publicly reachable services, HTTP and WS.
I created two target groups, one for each service, and the registered instances are the managers of the Docker Swarm. Then I created the ALB and added two HTTPS listeners, each one pointing to the specific target group.
Now comes the problem. The HTTP health check passes without no problem, but the Websocket check is always unhealthy, and I don't know why. According to http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html, using a HTTP/HTTPS listener should work for WS/WSS as well.
In the WS check, I have tried as path both / and the path the application is actually using /ws. Neither of them passes the health check.
It's not a problem related to firewall either. Security groups are wide open and there are no iptables rules, so connection is possible in both directions.
I launched the websocket container out of Docker Swarm, just to test if it was something related to Swarm (which I was pretty sure it was not, but hell.. for testing's sake), and it did not work either, so now I'm a little out of hope.
What configuration might I be missing, so that HTTP services work but Websocket services don't?.
UPDATE
I'm back with this issue, and after further researching, the problem seems to be the Target Group, not the ALB per se. Reading through the documentation http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html, I had forgotten to enable the stickiness option. However, I just did, and the problem persists.
UPDATE 2
It looks like the ELB is not upgrading the connection from HTTP to WebSocket.

ALBs do not support websocket health checks per:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
" Health checks do not support WebSockets."
The issue is that despite AWS claiming that the ALB supports HTTP2 in fact it downsamples everything to HTTP1 and then does it's thing then upgrades it to HTTP2 again which breaks everything.

I had a similar issue. When the ALB checks the WS service it receives an HTTP status of 101 (Switching protocols) from it. And, as noted in other answers, that's not a good-enough response for the ALB. I attempted changing the matcher code in the health check config but it doesn't allow anything outside the 200-299 range.
In my setup I had Socket.io running on top of Express.js so I solved it by changing the Socket.io path (do not confuse path with namespace) to /ws and let the Express.js answer the requests for /. Pointed the health check to / and that did it.

aws ELB log missing backend_status_code & elb_status_code

When I activated the logging on my ELB instance, I noticed that both the ELB_status_code and backend_status_code were missing.
I have a setup where the ELB redirects everything to 2 ha-proxies.
in the ha-proxy log the status code is visible.
The ELB is doing TCP:80 > TCP:80 using the proxy protocol.
Is there anything that i have to do specifically to enable the status code logging?

Those fields only apply to [HTTP listener].
When ELB is running in TCP mode, it's not aware that the protocol running through it happens to be HTTP, so those status codes can't be logged.
If you really want to see them, you'll need the ELB in HTTP mode... but whether that's the right choice depends on why you are using TCP mode -- web sockets, for example, require TCP mode.
Note also that if you switch the ELB to HTTP mode, some of the Tq/Tw/Tc/Tr/Tt the timers in the HAProxy logs will show values that are initially confusing, because the ELB holds connections open to the back-end (which is HAProxy) for reuse in a way that differs somewhat from the way browsers tend to. Logging the %Ci and %Cp parameters in HAProxy will help make some sense of these by allowing you to correlate them.
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format

How can I configure an automatic timeout for an Elastic Load Balancer?

Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers

You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.

TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.

Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js