amazon aws ec2 instance status is unhealthy. Website loading time very high - amazon-web-services

I am using aws ec2 m1.medium as my webserver. From last two days website loading very slow. Health check status in amazon route53 shows Unhealthy. The following status shows in health checkers
Failure: Resolved IP: [my ip]. The endpoint did not respond to the health checker request within the timeout limit.
When i check in mxToolBox
Failure - response over threshold (12.21s/10s)
Can anybody help please.

The AWS network in itself never takes this much time to reach to the servers and it must be your application responding slow. to make sure ping the node first and then telnet/nc on the specific port that your application is using.
telnet <ip> <port>
netcat -u <ip> <port>
if you find there is a significant difference, then you need to troubleshoot your application which might be having any kind of issue.
If not, then you may have faulty ELB sitting in between that is causing such high latency and you might wanna restart/replace that.

Related

502 errors are due to healthcheck setup or resource exhaustion

My setup is a bitnami wordpress hosted on GCP's N2-standard-2 VM. I'm using a HTTPS load balancer and CDN.
I encountered the 502 errors a few times ever since I configured a load balancer. I was doing quite a bit of seo and page scanning tests when this happened.
I've checked that the VM is only using 8-12% of the disk capacity. The log shows CPU Max usage is 9.62%. I've to restart the VM to resolve the error.
What are the cause of the 502 errors
Could it be due to the traffic spike from third party scanning sites?
Is it because of my health check configuration?
Do I have to change a machine type and increase the memory?
What should I look into to troubleshoot it?
This is my healthcheck setup
This is my healthcheck setup
The server was down again and this time round I managed to look for the information you have suggested.
The error is not from Load Balancer
The error is from VM and the error message is:
"Error watching metadata: Get http://169.254.169.254/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=ag92d16ff423b06: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
VM disk size is 100GB. Machine Type is N2-standard-2 VM
It is a Wordpress Instance
Everything is within Quota
Incidents happen on a few occasions:
when I use third party site to scan the website for deadlinks. After the scan is completed, the server will go down shortly after. I have to reboot the instance to make it functional again.
It happens randomly and recover by itself after a while
Thanks everyone for your help. I just managed to figure out how to retrieve the other required info.
I was wrong that the load balancer didn't report any errors.
Below is from Logging
From Loadbalancer : Client disconnected before any response
From Loadbalancer: 502 - failed_to_pick_backend
From Unmanaged Instance Group: Timeout waiting for data and HTTP Response Internal server error
I tried to increase the Load Balancer timeout duration, the VM stills shut down and rebooted on its own. Sometimes it takes a few minutes to recover and sometimes it takes about an hour plus.
I provided some screenshots which recorded the recent incident from 8.47 to 8.54.
Below is from Monitoring

Service not responding to ping command

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?
You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!
Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.
The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).
Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

Health checking redis container with ALB

I have deployed a redis container using Amazon ECS, behind an application load balancer. It seems the health checks are failing, though the container is running and ready to accept connections. It seems to be failing because the health check is HTTP, and redis of course isn't an http server.
# Possible SECURITY ATTACK detected. It looks like somebody is sending
POST or Host: commands to Redis. This is likely due to an attacker
attempting to use Cross Protocol Scripting to compromise your Redis
instance. Connection aborted.
Fair enough.
Classic load balancers I figure would be fine since I can explicitly ping TCP. Is is feasible to use redis with ALB?
Change your health check to protocol HTTPS. All Amazon Load Balancers support this. The closer your health check is to what the user accesses the better. Checking an HTML page is better than a TCP check. Checking a page that requires backend services to respond is better. TCP will sometimes succeed even if your web server is not serving pages.
Deploy your container with nginx installed and direct the health check to nginx handling port.
I encountered a similar problem recently: My Redis container was up and working correctly, but the # Possible SECURITY ATTACK detected message appeared in the logs once every minute. The healthcheck was curl -fs http://localhost:6379 || exit 1; this was rejected by the Redis code (search for "SECURITY ATTACK").
My solution was to use a non-CURL healthcheck: redis-cli ping || exit 1 (taken from this post). The healthcheck status shows "healthy", and the logs are clean.
I know the solution above will not be sufficient for all parties, but hopefully it is useful in forming your own solution.

How can I configure an automatic timeout for an Elastic Load Balancer?

Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers
You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.
TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.
Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/