Elastic Beanstalk reports 5xx errors even though instances are in perfect health - amazon-web-services

I need to set up an api application for gathering event data to be used in a recommendation engine. This is my setup:
Elastic Beanstalk env with a load balancer and autoscaling group.
I have 2x t2.medium instances running behind a load balancer.
EBS configuration is 64bit Amazon Linux 2016.03 v2.1.1 running Tomcat 8 Java 8
Additionally I have 8x t2.micro instances that I use for high-load testing the api, sending thousands of requests/sec to be handled by the api.
Im using Locust (http://locust.io/) as my load testing tool.
Each t2.micro instance that is run by Locust can send up to about 500req/sec
Everything works fine while the reqs/sec are below 1000, maybe 1200. Once over that, my load balancer reports that some of the instances behind it are reporting 5xx errors (attached). I've also tried with 4 instances behind the load balancer, and although things start out well with up to 3000req/sec, soon after, the ebs health tool and Locust both report 503s and 504s, while all of the instances are in perfect health according to the actual numbers in the ebs Health Overview, showing only 10%-20% CPU utilization.
Is there smth I'm missing in configuring the env? It seems like no matter how many machines I have behind the load balancer, the env handles no more than 1000-2000 requests per second.
EDIT:
Now I know for sure that it's the ELB that is causing the problems, not the instances.
I ran a load test with 10 simulated users. Each user sends about 1req/sec and the load increases by 10 users/sec to 4000 users, which should equal to about 4000req/sec. Still it doesn't seem to like any request rate over 3.5k req/sec (attachment1).
As you can see from attachment2, the 4 instances behind the load balancer are in perfect health, but I still keep getting 503 errors. It's just the load balancer itself causing problems. Look how SurgeQueueLength and SpilloverCount increase rapidly at some point. (attachment3) I'm trying to figure out why.
Also I completely removed the load balancer and tested with just one instance alone. It can handle up to about 3k req/sec. (attachment4 and attachment5), so it's definitely the load balancer.
Maybe I'm missing some crucial limit that load balancers have by default, like the queue size of 1024? What is normal handle rate for 1 load balancer? Should I be adding more load balancers? Could it be related to availability zones? ELB listeners from one zone are trying to route to instances from a different zone?
attachment1:
attachment2:
attachment3:
attachment4:
attachment5:
UPDATE:
Cross zone load balancing is enabled
UPDATE:
maybe this helps more:

The message says that "9.8 % of the requests to the ELB are failing with HTTP 5xx (6 minutes ago)". This does not mean that your instances are not returning HTTP 5xx responses. The requests are failing at the ELB itself. This can happen when your backend instances are at capacity (e.g. connections are saturated and they are rejecting connections to the ELB).
Your requests are spilling over at the ELB. They never make it to the instance. If they were failing at the EC2 instances then the cause would be different and data for the environment would match the data for the instances.
Also note that the cause says that this was the state "6 minutes ago". Elastic Beanstalk multiple data sources - one is the data coming from the instance which shows the requests per second and HTTP status codes in the table shown. Another data source is cloudwatch metrics for your ELB. Since cloudwatch metrics for ELB are 1 minute, this data is slightly delayed and the cause tells you how old the information is.

Related

Why deploying fargate over a load balancer service takes more time than its own service?

I'm making some basic architecture with some backend and frontend docker containers.
I started deploying a backend container, defined a task with AWS Fargate and configured its own service. The time of the deployments took about 2 minutes, but now I have a Network Load Balancer service with target group targeting to my container, and the deploy takes about 6-10 minutes, is it normal?
This deploy time is not bothering me but, while I'm studying this technology, this is getting a little slow (for example against a kubernetes cluster), my plan is to have a bigger architecture with back, front, db and auto-scalers, just for learning.
Without sharing your configuration (CloudFormation Template) or some screenshots its hard to say exactly what is causing this issue but I can point you to some areas that could cause longer deployments.
Possibly have a default health heck interval set to 30 seconds. This means that it waits 30 seconds to carry out another check on the TCP port.
You can reduce this to 10 seconds to speed up health checks
Possibly have the default health check healthy threshold set as 3. This coupled with the default check interval adds around 90 seconds to stabilise a container and return a healthy status.
Deploying a single container and not having Cross-Zone load balancing enabled and also having round robin routing algorithm.
This setup can fail the health check and then the threshold count starts again and you need 3 in a row to return a healthy status.
Lastly, Fargate because there is no image caching the pull time is a bit longer than ECS backed by EC2 instances so can also add to the deployment time.
With all the above configured correctly you can get back a few minutes usually on a Network Load Balancer target group deployment but they are usually a bit slower than Application Load Balancer deployments.
If your app does not require a Network Load Balancer its recommended to use an application load balancer. An application load balancer goes much deeper and is capable of determining availability based on not only a successful HTTP GET of a particular page but also the verification that the content is as was expected based on the input parameters.

AWS Elastic Beanstalk restart docker if CPU is 100% for longer time

We have Elastic Beanstalk set to load balancing. When our app is consuming 100% CPU for longer time (i.e. after some downtime when we receive tons of webhooks) then the load balancer restarts docker inside the instance. Our app starts aprox. 2 minutes therefore you can never recover from downtime.
Is there any way how to extend this restart period or even disable it?
Scaling using CPU threshold is not an option for us as our app consumes lots of CPU during higher load.
This seems like a case of failed health check
You can go to your EC2 Dashboad => Load Balancers
Check the Load Balancer that target your EB, under the Health Check tab, you should see and edit the thresold of failed ping request to your instance until it is considered unhealthy and terminated
More information on health checks here and here
Increasing of an instance from small to medium actually solved my problem. It seems that the app could not handle this amount of load with limited resources of small instance type.

How can I get useful load testing data for my AWS server?

I have a system set up on AWS where I have a set of ec2 insatnces (as an application server from an elastic beanstalk) running in an auto-scaling load-balanced environment. All this works fine.
I would like to load test this instance in order to obtain results that help me to figure out what more needs to be done to the system in order for it to handle, potentially, millions of users. I have used a tool called Locust (http://locust.io) so far to do this. This allows me to send requests to my instance(s?) through a proxy as desired. However, I cannot tell whether the requests are being routed to multiple instances or the same one constantly; and if they are being load balanced appropriately I can't see how many requests each of the ec2 instances are receiving or their health under load. (I have a feeling that the requests are not being properly load balanced as the failure rate always seems to increase drastically at a similar point every test run.)
Is there a way to get this information inside from the AWS ec2 or elastic beanstalk consoles, or is there a better distributed web based load testing tool that can provide the data I need?
There are two ways to get this information
1) Create S3 Bucket and save ELB logs. You can filter these logs to check which instance is serving your request
2) Retrieve application level logs : If apache/nginx installed on your EC2 instances to serve the request. Filter apache/nginx logs in every machine
Hope it helps !!
There is a way to get this data from the AWS console.
Inside the elastic beanstalk console there is a tab titled health. This tab (in the enhanced health overview) shows the number of requests per second, the response for the requests, the latency, the load average and the CPU utilisation for each ec2 instance being run by the elastic beanstalk.
An example of this data is shown in the following image.
This data allows the system manager to see which of their back-end instances are receiving requests and how many they are each being sent through a load-balancer and a proxy.
This can also be attained from the AWS CLI using:
eb health environment_name

AWS ELB 502 at the same time every day

First some insight into how my setup is:
1 ELB
4 EC2 instances
2 web servers
1 to run the migrations, queue (beanstalkd) and scheduler
1 'services' server (socket.io instance etc etc)
MySQL on RDS
Redis on Elasticache
S3 for user assets
Every day at 10:55PM, users report getting white screens and 502 Bad Gateway errors. The ELB reports that both EC2 instances are OutOfService, yet I'm SSH'd into them and fully able to use the site by bypassing the ELB. RDS and Elasticache maintenance windows aren't during this period, and the two instances aren't at load either. I can't find anything in the ELB access logs, nothing in nginx logs on the instance end, nothing in the Laravel app logs. There's nothing in the Laravel scheduler that runs at this time either.
The only thing I've found, is that in my CloudWatch metrics, the ELB latency spikes right up to about 5-10 seconds. All this results in downtime of about 5-15 minutes at the same time every day. I can't seem to find anything that is causing the issue.
I'm 100% stumped as to what could be causing this to happen. Any help is appreciated.
What probably happens is that your web servers run out of connections, ELB cannot perform health checks and takes them out of service. It's actually enough for one of the machines to experience this and be taken out of service and the other will be killed as a cascading effect.
How many connections can the web servers hold at the same time?
Do you process a particularly "heavy request" at that point in time when this happens?
Does adding more web servers solve your problem?

Load balancer for php application

Questions about load balancers if you have time.
So I've been using AWS for some time now. Super basic instances, using them to do some tasks whenever I needed something done.
I have a task that needs to be load balanced now. It's not a public service though. It's pretty much a giant cron job that I don't want running on the same servers as my website.
I set up an AWS load balancer, but it doesn't do what I expected it to do.
It get's stuck on one server, and doesn't load balance at all. I've read why it does this, and that's all fine and well, but I need it to be a serious round-robin load balancer.
edit:
I've set up the instances on different zones, but no matter how many instances I add to the ELB, it just uses one. If I take that instance down, it switches to a different one, so I know it's working. But I really would like it to always use a different one under every circumstance.
I know there are alternatives. Here's my question(s):
Would a custom php load balancer be an ok option for now?
IE: Have a list of servers, and have php randomly select a ec2 instance. Wouldn't be scalable at all, bu atleast I could set this up in 2 mins and it can work for now.
or
Should I take the time to learn how HAProxy works, and set that up in place of the AWS ELB?
or
Am I doing it wrong, and AWS's ELB does do round-robin. I just have something configured wrong?
edit:
Structure:
1) Web server finds a task to do.
2) If it's too large it sends it off to AWS (to load balancer).
3) Do the job on EC2
4) Report back via curl to an API
5) Rinse and repeat
Everything works great. But because the connection always comes from my server (one IP) it get's sticky'd to a single EC2 machine.
ELB works well for sites whose loads increase gradually. If you are expecting an uncommon and sudden increase on the load, you can ask AWS to pre-warm it for you.
I can tell you I used ELB in different scenarios and it always worked well for me. As you didn't provide too much information about your architecture, I would bet that ELB works for you, and the case that all connections are hitting only one server, I would ask you:
1) Did you check the ELB to see how many instances are behind it?
2) The instances that you have behind the ELB, are all alive?
3) Are you accessing your application through the ELB DNS?
Anyway, I took an excerpt from the excellent article that does a very good comparison between ELB and HAProxy. http://harish11g.blogspot.com.br/2012/11/amazon-elb-vs-haproxy-ec2-analysis.html
ELB provides Round Robin and Session Sticky algorithms based on EC2
instance health status. HAProxy provides variety of algorithms like
Round Robin, Static-RR, Least connection, source, uri, url_param etc.
Hope this helps.
This point comes as a surprise to many users using Amazon ELB. Amazon
ELB behaves little strange when incoming traffic is originated from
Single or Specific IP ranges, it does not efficiently do round robin
and sticks the request. Amazon ELB starts favoring a single EC2 or
EC2’s in Single Availability zones alone in Multi-AZ deployments
during such conditions. For example: If you have application
A(customer company) and Application B, and Application B is deployed
inside AWS infrastructure with ELB front end. All the traffic
generated from Application A(single host) is sent to Application B in
AWS, in this case ELB of Application B will not efficiently Round
Robin the traffic to Web/App EC2 instances deployed under it. This is
because the entire incoming traffic from application A will be from a
Single Firewall/ NAT or Specific IP range servers and ELB will start
unevenly sticking the requests to Single EC2 or EC2’s in Single AZ.
Note: Users encounter this usually during load test, so it is ideal to
load test AWS Infra from multiple distributed agents.
More info at the Point 9 in the following article http://harish11g.blogspot.in/2012/07/aws-elastic-load-balancing-elb-amazon.html
HAProxy is not hard to learn and is tremendously lightweight yet flexible. I actually use HAProxy behind ELB for the best of both worlds -- the hardened, managed, hands-off reliability of ELB facing the Internet and unwrapping SSL, and the flexible configuration of HAProxy to allow me to fine tune how things hit my servers. I've never lost an HAProxy instance yet, but it I do, ELB will just take that one out of rotation... as I have seen happen when the back-end servers have all become inaccessible, which (because of the way it's configured) makes ELB think the HAProxy is unhealthy, but that's by design in my setup.