Application Load Balancer Behaving Badly - amazon-web-services

So last week we noticed a spike in our ELB 400 errors, and when we dove into the logs we discovered something strange that we can't figure out.
A request url formatted like below is hitting the Application Load Balancer 40 times a second consistently.
https://autodeploy-xxxxxxxx.us-east-1.elb.amazonaws.com:443
The IPs seem to come from Amazon themselves, which make sense since the above url is not how we send traffic to our ELB. and the ports the are coming over are not ports enabled on the ELB hence the 400 errors, but this is significantly increasing our costs as our Load Balancer Units have doubled because of it, and we haven't added any service that should act this way.
Sample IP: 34.207.214.113:21556
Any Ideas?

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

502 - failed_to_connect_to_backend from LB on HEALTHY target group

I have some global Load balancer on the GCP. This balancer should send requests to the instance group with two back services.
And when I try to send some requests, I randomly get 502 errors failed_to_connect_to_backend from my load balancer.
I can get a successful answer seven times, one by one, and then 2-3 times 502 error for the same request.
In the
Monitoring Dashboard I see this - my both services are healthy.
The Instanse groups overwiev shows 100% healthy status too.
URL map rules is default default
I also don`t see any problems with resource consumption
And, unfortunately, I couldn't get any logs from the back-end side for the 502 errors, have only logs from the Load Balancer
After hours of coffee and liters of manuals (I'm not very well versed in GCP yet) the "problem" is solved - at some point I noticed that the execution time of all failed requests is ~ 9 seconds.
Therefore, I tried to search for results with similar symptoms, as a result I found a answer on the Google Groups
In my case - we have trobles with port mapping(was used two ports in mapping - like 80, 6000. And 80 - was not listening from the backend side)
After removing closed - 502s gone away.
If port 80 was not allowed on the firewall rule applied on the backend instances?

AWS - Abnormal Data Transfer OUT

I´m consistently being charged for a surprisingly high amount of data transfer out (from Amazon to Internet).
I looked into the Usage Reports of the past few months and found out that the Data Transfer Out was coming out of an Application Load Balancer (ALB) between the Internet and multiple nodes of my application (internal IPs).
Also noticed that DataTransfer-Out-Bytes is very close to the DataTransfer-In-Bytes in the same load balancer, which is weird (coincidence?). I was expecting the response to each request to be way smaller than the request itself.
So, I enabled flow logs in the ALB for a few minutes and found out the following:
Requests coming from the Internet (public IPs) in to ALB = ~0.47 GB;
Requests coming from ALB to application servers in the same availability zone = ~0.47 GB - ALB simply passing requests through to application servers, as expected. So, about the same amount of traffic.
Responses from application servers back into the same ALB = ~0.04 GB – As expected, responses generate way less traffic back into ALB. Usually a 1K request gets a simple “HTTP 200 OK” response.
Responses from ALB back to the external IP addresses => ~0.43 GB – this was mind-blowing. I was expecting ~0.04GB, the same amount received from the application servers.
Unfortunately, ALB does not allow me to use packet sniffers (e.g. tcpdump) to see that is actually coming in and out. Is there anything I´m missing? Any help will be much appreciated. Thanks in advance!
Ricardo.
I believe the next step in your investigation would be to enable ALB access logs and see whether you can correlate the "sent_bytes" in the ALB access log to either your Flow log or your bill.
For information on ALB access logs see: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html
There is more than one way to analyze the ALB access logs, but I've always been happy to use Athena, please see: https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/

Fixing AWS Application Load Balancer connection timeouts under small loads

I'm currently using an application load balancer to split traffic between 3 instance.
Testing the individual instance (connection via IP), they are able to handle about ~200req/second without any connection timeout (timeout being set a 5 seconds).
As such, I'd assume that load balancing over all 3 of them would scale this up to ~600req/second (there's no bottleneck further down the pipe to stop this).
However, when sending the exact same type of test requests to an application load balancer, the connections start to time out before I even hit 100req/second.
I already eliminated the possibly of a low down due to HTTPS (by just sending the requests over HTTP), the instance themselves are healthy and not under heavy use an the load balancer reports no "errors".
I've also configured IP stickiness for ~20 minutes to try and improve the situation but it hasn't helped one bit.
What could be the cause of this problem ? I found no information about increasing the network capacity of LB on aws and no similar questions, so I'm bound to be doing something wrong, but I'm quite unsure what that something is.

Elastic Beanstalk reports 5xx errors even though instances are in perfect health

I need to set up an api application for gathering event data to be used in a recommendation engine. This is my setup:
Elastic Beanstalk env with a load balancer and autoscaling group.
I have 2x t2.medium instances running behind a load balancer.
EBS configuration is 64bit Amazon Linux 2016.03 v2.1.1 running Tomcat 8 Java 8
Additionally I have 8x t2.micro instances that I use for high-load testing the api, sending thousands of requests/sec to be handled by the api.
Im using Locust (http://locust.io/) as my load testing tool.
Each t2.micro instance that is run by Locust can send up to about 500req/sec
Everything works fine while the reqs/sec are below 1000, maybe 1200. Once over that, my load balancer reports that some of the instances behind it are reporting 5xx errors (attached). I've also tried with 4 instances behind the load balancer, and although things start out well with up to 3000req/sec, soon after, the ebs health tool and Locust both report 503s and 504s, while all of the instances are in perfect health according to the actual numbers in the ebs Health Overview, showing only 10%-20% CPU utilization.
Is there smth I'm missing in configuring the env? It seems like no matter how many machines I have behind the load balancer, the env handles no more than 1000-2000 requests per second.
EDIT:
Now I know for sure that it's the ELB that is causing the problems, not the instances.
I ran a load test with 10 simulated users. Each user sends about 1req/sec and the load increases by 10 users/sec to 4000 users, which should equal to about 4000req/sec. Still it doesn't seem to like any request rate over 3.5k req/sec (attachment1).
As you can see from attachment2, the 4 instances behind the load balancer are in perfect health, but I still keep getting 503 errors. It's just the load balancer itself causing problems. Look how SurgeQueueLength and SpilloverCount increase rapidly at some point. (attachment3) I'm trying to figure out why.
Also I completely removed the load balancer and tested with just one instance alone. It can handle up to about 3k req/sec. (attachment4 and attachment5), so it's definitely the load balancer.
Maybe I'm missing some crucial limit that load balancers have by default, like the queue size of 1024? What is normal handle rate for 1 load balancer? Should I be adding more load balancers? Could it be related to availability zones? ELB listeners from one zone are trying to route to instances from a different zone?
attachment1:
attachment2:
attachment3:
attachment4:
attachment5:
UPDATE:
Cross zone load balancing is enabled
UPDATE:
maybe this helps more:
The message says that "9.8 % of the requests to the ELB are failing with HTTP 5xx (6 minutes ago)". This does not mean that your instances are not returning HTTP 5xx responses. The requests are failing at the ELB itself. This can happen when your backend instances are at capacity (e.g. connections are saturated and they are rejecting connections to the ELB).
Your requests are spilling over at the ELB. They never make it to the instance. If they were failing at the EC2 instances then the cause would be different and data for the environment would match the data for the instances.
Also note that the cause says that this was the state "6 minutes ago". Elastic Beanstalk multiple data sources - one is the data coming from the instance which shows the requests per second and HTTP status codes in the table shown. Another data source is cloudwatch metrics for your ELB. Since cloudwatch metrics for ELB are 1 minute, this data is slightly delayed and the cause tells you how old the information is.