RabbitMQ MochiWeb on AWS behind Load Balancer - amazon-web-services

I have an AWS setup with an Elastic Load Balancer that talks to a RabbitMQ cluster of two nodes. There is a plugin called RabbitHub that runs on MochiWeb as a REST interface to RabbitMQ. My problem is that I get a lot of 504 GATEWAY_TIMEOUT errors, with or without the load balancer. I'm forwarding HTTPS to HTTP on 15670 through the load balancer, but even when I go directly to the server through a VPN, I'll get a 504.
It appears that most GET requests work (like the base URL), but I have a significant issue with POSTs. Sometimes it works...sometimes it doesn't. I had about 4 good hours today, then went back to a nasty 2 hours. I'm really at the end of my knowledge here. What could be causing this?
AWS docs say to increase the keep-alive on the web server. Is that possible on MochiWeb?
Thanks --
Robert

Well, forget I said anything. The problem was that a MochiWeb error occurred, which closed the connection immediately. As a result, the load balancer reports the 504. If this helps anybody, I checked the SASL logs to see that there was an actual Erlang error. From there, I could see what the issue was and address it.

Related

gcp classic loadbalancer vs modern loadbalancer doesn't work with websocket

We are having some issues with getting websockets to work with a load balancer in google cloud. We narrowed it down to a difference between the classic load balancer (works fine) and the Https Loadbalancer with advanced traffic management that is selected by default but marked as a preview (does not work).
We have an instance group that definitely supports websockets. I.e. we can connect to it via the ip address.
We set up a load balancer and went for the one with traffic management. That worked fine for normal requests but all the websocket requests fail with a 502. We did not select http/2 (which is documented as not working for this). We tried all sorts of things to get this working. Even though it is documented that this should work out of the box it clearly doesn't.
$ websocat wss://lb.tryformation.com/websocket/messages
websocat: WebSocketError: Received unexpected status code (502 Bad Gateway)
websocat: error running
As a last resort, I then set up a classic lb with the same configuration, same instance group, same health check, same certificate, etc. And this worked on the first try.
So, clearly the new style loadbalancer does not work as advertised when it comes to websockets. The question is: why? Is this a known issue or is there something I should configure to get websockets working with that?
We're fine using the classic lb as it works. But I would like to understand the issue.
FWIW:
Assuming you're using GCP's Global External HTTP(S) "modern" Load Balancer, the documentation states under GCP CLB Overview > WebSocket support states:
The global external HTTP(S) load balancer with advanced traffic management capability does not support Websockets. Websockets work with the global external HTTP(S) load balancer (classic) and regional external HTTP(S) load balancer as expected.
If you're using the regional "modern" LB, keep in mind that these "modern" Load Balancers are still in Preview. I'm sure you've seen this, but I'm only noting this because I've had experience with GCP products in the past that claimed to "support websockets" while in "Preview", but didn't work correctly until avaiable in GA.
Since you didn't provide more details It's impossibler to reproduce it - hence try to conclude anything - there are just too many variables.
From your description it looks like some issue with traffic management in https load balacing - if you can reproduce it you can at Google's IssueTracker - under the load balancing component and describe the issue in more detail; provide detailed reproductions steps and if possible your setup that you used (or any other details that - after that someone will get back to you :)

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!
Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.
The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).
Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

AWS Application load balancer rule not working for cookies ? What could have gone wrong?

I work in the dev-ops team at my company. Recently, we shifted to aws's application load balancer and we are forwarding the request based on a cookie's value. For some reason, the rule isn't working and AWS doesn't support logs to get information on why a rule faied.
There could be 2 reasons for this, that we can think of:
Load balancer isn't able to read the cookie: We don't think this should be the issue as the applications under this load balancer are able to read and also print the cookies.
The load balancer doesn't read subsequent cookies after the first request: We have raised a concern with AWS on this and they are still to get back.
Meanwhile, can anyone point to any possible issues which we might be overlooking?

AWS returns 503 for some websites

I have just started playing around with AWS and I've been reading the docs as I go but I have run across a strange problem that I cannot explain and I was hoping someone experienced in AWS would be able to answer.
Certain websites return a 503 http response from my EC2 node, yet others do not. For instance: Canadian Company Capabilities returns a 503 via lynx and other tools yet Government Login does not.
Is one being blocked from outside of Canada or not? How can I diagnose the root cause of the 503?
EDIT
I should mention I am using a standard CentOs, free tier ec2 instance. The rest of the pipeline is out of the box AWS free-tier as well.
EDIT 2 I have connected through a VPN in the states and it works fine as well which leads me to believe it's something I am missing with AWS.
How can I diagnose the root cause of the 503
You contact the site's administrator.
This won't be an AWS issue. AWS does not block, screen, scan, filter, modify, or otherwise manipulate Internet traffic that you initiate. It's possible for your own misconfiguration to block traffic entirely, but 503 is an HTTP error, which implies that you're making a connection to the distant end.
The exception to the above is outbound TCP port 25, which is not blocked but is very aggressively rate-limited unless you take the necessary steps to remove the block... but I mention this only for thoroughness; it of course would not be relevant to the issue at hand.

How can I configure an automatic timeout for an Elastic Load Balancer?

Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers
You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.
TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.
Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/