AWS ELB 502 at the same time every day - amazon-web-services

First some insight into how my setup is:
1 ELB
4 EC2 instances
2 web servers
1 to run the migrations, queue (beanstalkd) and scheduler
1 'services' server (socket.io instance etc etc)
MySQL on RDS
Redis on Elasticache
S3 for user assets
Every day at 10:55PM, users report getting white screens and 502 Bad Gateway errors. The ELB reports that both EC2 instances are OutOfService, yet I'm SSH'd into them and fully able to use the site by bypassing the ELB. RDS and Elasticache maintenance windows aren't during this period, and the two instances aren't at load either. I can't find anything in the ELB access logs, nothing in nginx logs on the instance end, nothing in the Laravel app logs. There's nothing in the Laravel scheduler that runs at this time either.
The only thing I've found, is that in my CloudWatch metrics, the ELB latency spikes right up to about 5-10 seconds. All this results in downtime of about 5-15 minutes at the same time every day. I can't seem to find anything that is causing the issue.
I'm 100% stumped as to what could be causing this to happen. Any help is appreciated.

What probably happens is that your web servers run out of connections, ELB cannot perform health checks and takes them out of service. It's actually enough for one of the machines to experience this and be taken out of service and the other will be killed as a cascading effect.
How many connections can the web servers hold at the same time?
Do you process a particularly "heavy request" at that point in time when this happens?
Does adding more web servers solve your problem?

Related

Setting up Latency Routing in AWS

I've been digging in the AWS docs for ages and am at my wits end trying to find non AWS official examples.
How do I decide if I should have failover and latency routing or should I have both? I currently have the site on Elastic beanstalk with both a dev and production version, but I get a 500 or 502 errors at least a couple times a month where if you refresh the page, it eventually loads but then the CSS is missing or the page doesn’t load and sometimes the page is just slow to load even with caching. How am I supposed to know if it’s a need for failover or latency routing, or should I have both? The AWS notifications only say “Environment health has transitioned from Degraded to Severe”. How do I log where/which AWS server Route 53 had serve the page?
Are you supposed to have multiple EC2 instances for latency based routing? I’m confused why the docs say to create a latency record for each of my EC2 instances.
https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/TutorialTransitionToLBR.html
I currently have Codepipeline connected to my Github, so that changes are automatically deployed to the dev site, and then I manually approve changes to production. If I have multiple EC2 instances, do I need to set up the code pipeline for each EC2 instance such that it’s connected to my Github and then manually approve changes for all instances—ie would I just have multiple copies of the site hosted in diff regions in this situation? How do people manage this? I’m assuming there’s some way to approve production launch for all at once if this is what is done but I don't know what to google

AWS EC2 instance fails consistently at 30 seconds on long page load

I am running an ECS instance on EC2 with an application load balancer, a route53 domain, and a RDS db. This is an internal business application that I have restricted IP access to.
I have ran this app for 3 weeks with no issues. However, today the data that the web app ingests is an abnormally large size. This is not a mistake. Due to this though, a webpage is taking approximately 4 minutes to complete which I verified on my local machine it completes. However, running the same operation on AWS fails at precisely 30 seconds every time.
I have connected the app running on my local machine to my production RDS db and am able to download and upload the data with no issue. So there is no issue with the RDS db. In addition, this same functionality has worked previously and only failed today due to the large amount of data.
I spent hours with Amazon support to solve this issue but we couldn't figure it out. I am assuming it is a setting for one the AWS services I am using that has a TTL or timeout set to 30 seconds, but I couldn't find it in any of the services I am using:
route53
RDS
ECS
ECR
EC2
Load Balancer
Target Group
You have a backend instance timeout, likely in the web server config.
Right now your ELB has a timeout of 60 seconds, but your assets are failing at 30.
There are only a couple assets on AWS with hardcoded timeouts like that. I'm thinking (because this is the first time it's happened), you have one of the following:
Size limits in the upstream, or
Time limits on connection keep-alive
Look at your website server software (httpd/nginx). Nginx has something called "upstream.conf" where you can set upstream timeouts. I'm not sure of httpd does as well.
Resources:
https://serverfault.com/questions/414987/nginx-proxy-timeout-while-uploading-big-files
From the NLB documentation, maybe relevant
EC2 instances must respond to a new request within 30 seconds in order to establish a return path.
I don't actually know what a return path is, nor what a 'response' is in this context since NLB has no concept of requests or responses.
- https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
EDIT: Disregard, this must have to do with UDP NATing. 'Response' here is probably a packet going back from the EC2 instance to the client

Elastic Beanstalk reports 5xx errors even though instances are in perfect health

I need to set up an api application for gathering event data to be used in a recommendation engine. This is my setup:
Elastic Beanstalk env with a load balancer and autoscaling group.
I have 2x t2.medium instances running behind a load balancer.
EBS configuration is 64bit Amazon Linux 2016.03 v2.1.1 running Tomcat 8 Java 8
Additionally I have 8x t2.micro instances that I use for high-load testing the api, sending thousands of requests/sec to be handled by the api.
Im using Locust (http://locust.io/) as my load testing tool.
Each t2.micro instance that is run by Locust can send up to about 500req/sec
Everything works fine while the reqs/sec are below 1000, maybe 1200. Once over that, my load balancer reports that some of the instances behind it are reporting 5xx errors (attached). I've also tried with 4 instances behind the load balancer, and although things start out well with up to 3000req/sec, soon after, the ebs health tool and Locust both report 503s and 504s, while all of the instances are in perfect health according to the actual numbers in the ebs Health Overview, showing only 10%-20% CPU utilization.
Is there smth I'm missing in configuring the env? It seems like no matter how many machines I have behind the load balancer, the env handles no more than 1000-2000 requests per second.
EDIT:
Now I know for sure that it's the ELB that is causing the problems, not the instances.
I ran a load test with 10 simulated users. Each user sends about 1req/sec and the load increases by 10 users/sec to 4000 users, which should equal to about 4000req/sec. Still it doesn't seem to like any request rate over 3.5k req/sec (attachment1).
As you can see from attachment2, the 4 instances behind the load balancer are in perfect health, but I still keep getting 503 errors. It's just the load balancer itself causing problems. Look how SurgeQueueLength and SpilloverCount increase rapidly at some point. (attachment3) I'm trying to figure out why.
Also I completely removed the load balancer and tested with just one instance alone. It can handle up to about 3k req/sec. (attachment4 and attachment5), so it's definitely the load balancer.
Maybe I'm missing some crucial limit that load balancers have by default, like the queue size of 1024? What is normal handle rate for 1 load balancer? Should I be adding more load balancers? Could it be related to availability zones? ELB listeners from one zone are trying to route to instances from a different zone?
attachment1:
attachment2:
attachment3:
attachment4:
attachment5:
UPDATE:
Cross zone load balancing is enabled
UPDATE:
maybe this helps more:
The message says that "9.8 % of the requests to the ELB are failing with HTTP 5xx (6 minutes ago)". This does not mean that your instances are not returning HTTP 5xx responses. The requests are failing at the ELB itself. This can happen when your backend instances are at capacity (e.g. connections are saturated and they are rejecting connections to the ELB).
Your requests are spilling over at the ELB. They never make it to the instance. If they were failing at the EC2 instances then the cause would be different and data for the environment would match the data for the instances.
Also note that the cause says that this was the state "6 minutes ago". Elastic Beanstalk multiple data sources - one is the data coming from the instance which shows the requests per second and HTTP status codes in the table shown. Another data source is cloudwatch metrics for your ELB. Since cloudwatch metrics for ELB are 1 minute, this data is slightly delayed and the cause tells you how old the information is.

AWS ELB Latency issue

I have two c3.2xlarge EC2 machines with Ubuntu environment both in us-west-2a AZ. Both contains same code with mySQL database from AWS RDS (db.r3.2xlarge). Both instances are added to an ELB. Both has one cron scheduled that runs twice in a day.
ELB has been configured to raise the alarm once the threshold crosses 5.0. The CPU utilization of both the instances are by average 30 - 50. At peak hours hits 100% for a minute or two and then returns to normal. But ELB constantly raises alarm thrice a day. At this time, both instances has
CPU - ~50%
Memory - total - 14979
used - ~6000
free - ~9000
RDS CPU - ~30%
Connections - 200 to 300 /5,000
According to this https://aws.amazon.com/premiumsupport/knowledge-center/elb-latency-troubleshooting/ I could find nothing wrong with the instances. But still latency hits the peak and both instance fails to respond.
Till now, I am just removing one of the instance from the load balancer, restart the apache and then load it back and do the same for other instance. This does the job perfectly alright and the instances and ELB works good for next 6-10 hours. But this is not acceptable since, every day twice or thrice one has to take care of the server, needs it to restart.
I need to know, if there is anything wrong or any steps to be taken to resolve this problem.
from your question: it's not clear what the ELB alarm is monitoring. 5.0 500s?
What I guess happens is that when the CPU is spiking to 100% the service that sits behind the load balancer is slow to respond / not responding. Alert is triggered.
Even worse, if just one of the instances fails (assume cron jobs don't run at the same time) the ELB will take the instance out of service and the other instance will take all of the traffic. If one instance cannot take all of the traffic, what this means is that you will have the 2nd instance fails + trigger the alert also.
Why do you need to run the cron job on the same machine as the service? Is moving this off these machines an option? Also: is increasing the ELB health check timeouts an option?

AWS and ELB Network throughput limits

My site runs on AWS and uses ELB
I regularly see 2K con-current users, and during these times, requests through my stack would become slow and take a long time to get a response (30s-50s)
None of my servers or database at this time, would show significant load.
Which leads me to believe my issue could be related to ELB.
I have added some images of a busy day on my site, which shows graphs of my main ELB. Can you perhaps spot something that would give me insight into my problem?
Thanks!
UPDATE
The ELB in the screengrabs is my main ELB forwarding to multiple varnish cache servers. In my varnish vcl I would send misses for a couple of URL's but varnish have a queing behavior and what I ended up doing was set a high ttl for these request, and return hit_for_pass for them. What this does is let varnish know in the vcl_recv that these requests should be passed to the back-end immediately. Since doing this, the problem outlined above has completely been fixed
did you ssh into one of the servers? Maybe you reach some connection limit in apache or whatever server you run. Also check the cloudwatch monitors of EBS volumes attached to your instances, maybe they cause a io bottleneck.