How can I configure an automatic timeout for an Elastic Load Balancer? - amazon-web-services

Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers

You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.

TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.

Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!
Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.
The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).
Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

How to change AWS ELB status to InService?

A WordPress application is deployed in AWS Elastic Beanstalk that has a load balancer. I see sometimes there is ELB 5XX error. To make the instance OutOfService for the higher number of unhealthy threshold count, I set Unhealthy Threshold to 10. But sometimes health check fails and health is Severe. I get sometimes the error "% of the requests to the ELB are failing with HTTP 5xx". I checked the ELB access logs and sometimes request get the timeout (504) error and after a consecutive number of 504, ELB makes the instance OutOfService. I am trying to fix which request is failing.
What I don't know, is it possible to make the instance "InService" as quickly as possible. Because sometimes instance is OutOfService for 2-3 hours, which is really bad. Is there any good way to handle this situation. I am really in trouble with this situation. Looks like after the service is out, I have nothing to do. I am relatively new to AWS. Please help.
To solve this issue:
1) HTTP 504 means timeout. The resource that the load balancer is accessing on your backend is failing to respond. Determine what the path for the healthcheck from the AWS console.
2) In your browser verify that you can access the healthcheck path going around the load balancer. This may mean temporarily assigning an EIP to the EC2 instance. If the load balancer healthcheck is "/test/myhealthpage.php" then use "http://REPLACE_WITH_EIP/test/myhealthpage.php". For HTTPS listeners use https in your path.
3) Debug why the path that you specified is timing out and fix it.
Note: Healthcheck paths should not be to pages that do complicated tests or operations. A healthcheck should be a quick and simple GO / NO GO type of page.

AWS load balancer and maintenance page

I'm using AWS Load Balancer with 3 EC2 servers, and I'm trying to serve a Maintenance page when site is under maintenance.
This page need to return 503 HTTP code, because it is a proper code for a maintenance mode and will prevent possible problems with SEO.
When I return 503 code from any of my servers, Load Balancer makes it "Not In Service", and when all servers return 503, website returns a blank page (because all servers are disconnected).
My questions are:
1) Is there any way to serve a custom static page with a message for visitors from Load balancer if there is no healthy servers?
2) Or how to configure Load Balancer's Health Check that it will not consider 503 as a reason to mark server as "unhealthy"?
Thanks!
I've been searching for a quick way to do this. We need to return a 503 error to the world during DB upgrade, but white list a few IPs of developers so they can test it before opening back up to public.
Found a one spot solution::
Go to the Loader Balancer in EC2 and select the load balancer you would like to target. Below, you should see Listeners. Click on a listener, and edit the rule. Create a rule like this:
Now everyone gets a pretty maintenance page returned with a 503 error code, and only two IP addresses in the first rule will be able to browse to the site. Order is important, where the two IP exceptions are on top, then it goes down the list. The last item is always there by default.
Listener Rules for Your Application Load Balancer:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-update-rules.html
You could implement an additional route in your app server, let's say /hcm (for health check maintenance), that always responds 200 OK. When it's time for maintenance, you programmatically modify the ELB health check to use the /hcm target which returns 200 OK rather than / or /index.html, which both return 503 Service Unavailable. Revert these changes when exiting maintenance.
Might not meet your 503 requirement but a good option for this is using s3 and dns failover: https://aws.amazon.com/blogs/aws/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting/
The load balancer will serve a 503 for you when you no longer have any healthy server behind it so you should not do anything special.
If you return anything but a 200 on the health check, ELB will take the machine out of the load balancer after it fails the configured number of health checks.
So to recap, you can potentially serve 503 from your app when in maintenance, but you have to return 200 for health checks all the time. If you don't care about the content of the page, you can simply remove the machines from the load balancer (or fail health checks) and the LB will do the right thing for you.

Why does Elastic Load Balancing report 'Out of Service'?

I am trying to set up Elastic Load Balancing (ELB) in AWS to split the requests between multiple instances. I have created several images of my webserver based on the same AMI, and I am able to ssh into each individually and access the site via each distinct public DNS.
I have added each of my instances to the load balancer, but they all come back with the Status: Out of Service because they failed the health check. I'm mostly confused because I can access each instance from its public DNS, but I get a timeout whenever I visit the load balancer DNS name.
I've been trying to read through all the docs and googling it, but I'm stuck. Any pointers or links in the right direction would be greatly appreciated.
I contacted AWS support about this same issue. Apparently their system doesn't know how to handle cases were all of the instances behind the ELB are stopped for an extended amount of time. AWS support can manually refresh the statuses, if you need them up immediately.
The suggested fix it to de-register the ec2 instances from the ELB instead of just stopping them and re-register them when you start again.
Health check is (by default) made by accessing index.html on each instance incorporated in load balancer. If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating elastic load balancer.
Finally I got this working. The issue was with the Amazon Security Groups, because I've restricted the access to port 80 to few machines on my development area and the load balancer could not access the apache server on the instance. Once the load balancer gained access to my instance, it gets In Service.
I checked it with tail -f /var/log/apache2/access.log in my instance, to verify if the load balancer was trying to access my server, and to see the answer the server is giving to the load balancer.
Hope this helps.
If your web server is running fine, then it means the health check goes on a url that doesn't return 200.
A trick that works for me : go on the instance, type curl localhost:80/pathofyourhealthcheckurl
After you can adapt your health check url to always have a 200 response.
In my case, the rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.
I to faced same issue , i changed Ping Protocol from https to ssl .. it worked !
Go to Health Check --> click on Edit Health Check -- > change Ping protocol from HTTPS to SSL
Ping Target SSL:443
Timeout 5 seconds
Interval 30 seconds
Unhealthy Threshold 5
Healthy Threshold 10
For anyone else that sees this thread as this isn't listed:
Check that the health check is checking the port that the responding server is listening on.
E.g. node.js running on port 3000 -> Point healthcheck to port 3000;
Not port 80 or 443. Those are what your ALB will be using.
I spent a morning on this. Yes.
I would like to provide you a general way to solve this problem. When you have set up you web server like apache or nginx, try to read the access log file to see what happened. In my occasion, it report 401 error because I have add the basic auth in nginx. Of course, just like #ivankoni remind, it may because of the document you check is not exist.
I was working on the AWS Tutorial on hosting a web app and ran into this problem. Step 7b states the following:
"Set Ping Path to /. This sends queries to your default page, whether
it is named index.html or something else."
They could have put the forward slash in quotations like this "/". Make sure you have that in your health checks and not this "/." .
Adding this because I've spent hours trying to figure it out...
If you configured your health check endpoint but it still says Out of Service, it might be because your server is redirecting the request (i.e. returning a 301 or 302 response).
For example, if your endpoint is supposed to be /app/health/ but you only enter /app/health (no trailing slash) into the health check endpoint field on your ELB, you will not get a 200 response, so the health check will fail.
I had a similar issue. The problem appears to have been caused due to my using a HTTP health check and also using .htaccess to password protect the site.
I got the same error, in my case had to copy the particular html file from s3 bucket to "/var/www/html" location. The same html referenced in load balancer path.
The issue resolved after copying html file.
I had this issue too, and it was due to both my inbound and outbound rule for the Load Balancer's Security Group only allowing HTTP traffic on port 80. I needed to add another rule for HTTPS traffic on port 443.
I was also facing that same issue,
where ELB (Classic-Load-Balancer) try to request /index.html not / (root) while health check.
If it unable to find /index.html resource it says 'OutOfService'. Be Sure index.html should be available.