AWS ELB Latency issue - amazon-web-services

I have two c3.2xlarge EC2 machines with Ubuntu environment both in us-west-2a AZ. Both contains same code with mySQL database from AWS RDS (db.r3.2xlarge). Both instances are added to an ELB. Both has one cron scheduled that runs twice in a day.
ELB has been configured to raise the alarm once the threshold crosses 5.0. The CPU utilization of both the instances are by average 30 - 50. At peak hours hits 100% for a minute or two and then returns to normal. But ELB constantly raises alarm thrice a day. At this time, both instances has
CPU - ~50%
Memory - total - 14979
used - ~6000
free - ~9000
RDS CPU - ~30%
Connections - 200 to 300 /5,000
According to this https://aws.amazon.com/premiumsupport/knowledge-center/elb-latency-troubleshooting/ I could find nothing wrong with the instances. But still latency hits the peak and both instance fails to respond.
Till now, I am just removing one of the instance from the load balancer, restart the apache and then load it back and do the same for other instance. This does the job perfectly alright and the instances and ELB works good for next 6-10 hours. But this is not acceptable since, every day twice or thrice one has to take care of the server, needs it to restart.
I need to know, if there is anything wrong or any steps to be taken to resolve this problem.

from your question: it's not clear what the ELB alarm is monitoring. 5.0 500s?
What I guess happens is that when the CPU is spiking to 100% the service that sits behind the load balancer is slow to respond / not responding. Alert is triggered.
Even worse, if just one of the instances fails (assume cron jobs don't run at the same time) the ELB will take the instance out of service and the other instance will take all of the traffic. If one instance cannot take all of the traffic, what this means is that you will have the 2nd instance fails + trigger the alert also.
Why do you need to run the cron job on the same machine as the service? Is moving this off these machines an option? Also: is increasing the ELB health check timeouts an option?

Related

Fargate deployment restarting multiple times before it comes online

I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

AWS Elastic Beanstalk restart docker if CPU is 100% for longer time

We have Elastic Beanstalk set to load balancing. When our app is consuming 100% CPU for longer time (i.e. after some downtime when we receive tons of webhooks) then the load balancer restarts docker inside the instance. Our app starts aprox. 2 minutes therefore you can never recover from downtime.
Is there any way how to extend this restart period or even disable it?
Scaling using CPU threshold is not an option for us as our app consumes lots of CPU during higher load.
This seems like a case of failed health check
You can go to your EC2 Dashboad => Load Balancers
Check the Load Balancer that target your EB, under the Health Check tab, you should see and edit the thresold of failed ping request to your instance until it is considered unhealthy and terminated
More information on health checks here and here
Increasing of an instance from small to medium actually solved my problem. It seems that the app could not handle this amount of load with limited resources of small instance type.

Customizing/Architecting AWS ELB to have Zero Downtime

So other day we faced an issue where one of the instance behind our application load balancer failed Instance Status Check and System Check. It took about 10 sec (the minimum we can get) for our ELB to detect this and mark the instance as "unhealthy", however we lost some amount of traffic in those 10 seconds as the ELB kept routing traffic to the unhealthy instance. Is there a solution where we can avoid literally any downtime or am I being too unrealistic?
I'm sure this isn't the answer you want to hear, but in order to minimize traffic loss on your systems if 10s is not tolerable, you'll need to implement your own health check/load balancing solution. My organization has systems where packet loss is unacceptable as well, and that's what we needed to do.
This solution is twofold.
You need to implement your own load-balancing infrastructure. We chose to use Route53 weighted record sets (TTL of 1s, we'll get back to this) with equal weight for each server
Launch an ECS container instance per load-balanced EC2 instance whose sole purpose is to health check. It runs both DNS and IP health checks (requests library in python) and will add/remove the Route53 weighted record real-time as it sees an issue.
In our testing, however, we discovered that while the upstream DNS servers from Route53 honor the 1 second TTL upon removal of a DNS record, they "blacklist" that record (FQDN + IP combo) from coming back up again for up to 10 minutes (we get variance of resolution times from 1m-10m). So you'll be able to failover quickly, but you must take into account it will take up to 10 minutes for the re-addition of the record to be honored.

Elastic Beanstalk reports 5xx errors even though instances are in perfect health

I need to set up an api application for gathering event data to be used in a recommendation engine. This is my setup:
Elastic Beanstalk env with a load balancer and autoscaling group.
I have 2x t2.medium instances running behind a load balancer.
EBS configuration is 64bit Amazon Linux 2016.03 v2.1.1 running Tomcat 8 Java 8
Additionally I have 8x t2.micro instances that I use for high-load testing the api, sending thousands of requests/sec to be handled by the api.
Im using Locust (http://locust.io/) as my load testing tool.
Each t2.micro instance that is run by Locust can send up to about 500req/sec
Everything works fine while the reqs/sec are below 1000, maybe 1200. Once over that, my load balancer reports that some of the instances behind it are reporting 5xx errors (attached). I've also tried with 4 instances behind the load balancer, and although things start out well with up to 3000req/sec, soon after, the ebs health tool and Locust both report 503s and 504s, while all of the instances are in perfect health according to the actual numbers in the ebs Health Overview, showing only 10%-20% CPU utilization.
Is there smth I'm missing in configuring the env? It seems like no matter how many machines I have behind the load balancer, the env handles no more than 1000-2000 requests per second.
EDIT:
Now I know for sure that it's the ELB that is causing the problems, not the instances.
I ran a load test with 10 simulated users. Each user sends about 1req/sec and the load increases by 10 users/sec to 4000 users, which should equal to about 4000req/sec. Still it doesn't seem to like any request rate over 3.5k req/sec (attachment1).
As you can see from attachment2, the 4 instances behind the load balancer are in perfect health, but I still keep getting 503 errors. It's just the load balancer itself causing problems. Look how SurgeQueueLength and SpilloverCount increase rapidly at some point. (attachment3) I'm trying to figure out why.
Also I completely removed the load balancer and tested with just one instance alone. It can handle up to about 3k req/sec. (attachment4 and attachment5), so it's definitely the load balancer.
Maybe I'm missing some crucial limit that load balancers have by default, like the queue size of 1024? What is normal handle rate for 1 load balancer? Should I be adding more load balancers? Could it be related to availability zones? ELB listeners from one zone are trying to route to instances from a different zone?
attachment1:
attachment2:
attachment3:
attachment4:
attachment5:
UPDATE:
Cross zone load balancing is enabled
UPDATE:
maybe this helps more:
The message says that "9.8 % of the requests to the ELB are failing with HTTP 5xx (6 minutes ago)". This does not mean that your instances are not returning HTTP 5xx responses. The requests are failing at the ELB itself. This can happen when your backend instances are at capacity (e.g. connections are saturated and they are rejecting connections to the ELB).
Your requests are spilling over at the ELB. They never make it to the instance. If they were failing at the EC2 instances then the cause would be different and data for the environment would match the data for the instances.
Also note that the cause says that this was the state "6 minutes ago". Elastic Beanstalk multiple data sources - one is the data coming from the instance which shows the requests per second and HTTP status codes in the table shown. Another data source is cloudwatch metrics for your ELB. Since cloudwatch metrics for ELB are 1 minute, this data is slightly delayed and the cause tells you how old the information is.

AWS ELB 502 at the same time every day

First some insight into how my setup is:
1 ELB
4 EC2 instances
2 web servers
1 to run the migrations, queue (beanstalkd) and scheduler
1 'services' server (socket.io instance etc etc)
MySQL on RDS
Redis on Elasticache
S3 for user assets
Every day at 10:55PM, users report getting white screens and 502 Bad Gateway errors. The ELB reports that both EC2 instances are OutOfService, yet I'm SSH'd into them and fully able to use the site by bypassing the ELB. RDS and Elasticache maintenance windows aren't during this period, and the two instances aren't at load either. I can't find anything in the ELB access logs, nothing in nginx logs on the instance end, nothing in the Laravel app logs. There's nothing in the Laravel scheduler that runs at this time either.
The only thing I've found, is that in my CloudWatch metrics, the ELB latency spikes right up to about 5-10 seconds. All this results in downtime of about 5-15 minutes at the same time every day. I can't seem to find anything that is causing the issue.
I'm 100% stumped as to what could be causing this to happen. Any help is appreciated.
What probably happens is that your web servers run out of connections, ELB cannot perform health checks and takes them out of service. It's actually enough for one of the machines to experience this and be taken out of service and the other will be killed as a cascading effect.
How many connections can the web servers hold at the same time?
Do you process a particularly "heavy request" at that point in time when this happens?
Does adding more web servers solve your problem?