AWS - Abnormal Data Transfer OUT - amazon-web-services

I´m consistently being charged for a surprisingly high amount of data transfer out (from Amazon to Internet).
I looked into the Usage Reports of the past few months and found out that the Data Transfer Out was coming out of an Application Load Balancer (ALB) between the Internet and multiple nodes of my application (internal IPs).
Also noticed that DataTransfer-Out-Bytes is very close to the DataTransfer-In-Bytes in the same load balancer, which is weird (coincidence?). I was expecting the response to each request to be way smaller than the request itself.
So, I enabled flow logs in the ALB for a few minutes and found out the following:
Requests coming from the Internet (public IPs) in to ALB = ~0.47 GB;
Requests coming from ALB to application servers in the same availability zone = ~0.47 GB - ALB simply passing requests through to application servers, as expected. So, about the same amount of traffic.
Responses from application servers back into the same ALB = ~0.04 GB – As expected, responses generate way less traffic back into ALB. Usually a 1K request gets a simple “HTTP 200 OK” response.
Responses from ALB back to the external IP addresses => ~0.43 GB – this was mind-blowing. I was expecting ~0.04GB, the same amount received from the application servers.
Unfortunately, ALB does not allow me to use packet sniffers (e.g. tcpdump) to see that is actually coming in and out. Is there anything I´m missing? Any help will be much appreciated. Thanks in advance!
Ricardo.

I believe the next step in your investigation would be to enable ALB access logs and see whether you can correlate the "sent_bytes" in the ALB access log to either your Flow log or your bill.
For information on ALB access logs see: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html
There is more than one way to analyze the ALB access logs, but I've always been happy to use Athena, please see: https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

AWS Nat Gateway, wrong requests limits - high load - timeout

I did a load test for NAT Gateway in AWS. I reached a much lower requests-limit than described in docs. According to the docs the Nat is supposed to support ~900 requests per second, but with my configuration, I saw that ~0.04% of the requests are untreated when running ~300 requests per second.
I run node.js app using ECS cluster, and have the ability to configure requests per second. The NAT is working fine around 1 minute, and later my app starts to get timeouts for few requests.
AWS does not allow access to such machines, and the cloudwatch metrics seem fine.
In general, I am looking for a static ip solution that will withstand high loads. Does anyone here has experienced something similar?

ECS Fargate + Network Load Balancer Healthcheck

I'm experiencing an issue with the following setup:
API Gateway -> VPC Link -> Private NLB -> Target Group -> AWS ECS Fargate
If I setup the NLB's Health Check to be TCP/HTTP on a specified endpoint, that endpoint gets hammered to the death with internal request (no requests are coming through the API Gateway, I checked):
My problem with this behaviour, other than having the health's endpoint spammed by my own architecture is that the application's functionality is suffering (I keep getting slow responses 1 out of 4 get request to the API).
I tried to modify the Health Check's behaviour to only TCP, same slow responses.
I tried temporarily switching to a public ALB, I'm incurring in double health-checks, separated by 30 seconds but my application is responding with an average of 100 ms.
So, as an example of what I mean by "double health-checks":
Health Check 1.1 at 00:00:00
Health Check 2.1 at 00:00:10
Health Check 1.2 at 00:00:30
Health Check 2.2 at 00:00:40
Any ideas?
TL/DR;
Enable the "Cross-Zone Load Balancing" NLB flag.
The issue was the "cross-availability zone" not checked out.
It seems that when a request gets processed by a NLB-node which resides in a different AZ from the one that it is trying to be redirecting, it tries to internally resolve the IP in the AZ, if it fails, it redirects the request to another NLB-node in the appropriate AZ, which will be able to do so, hence reaching the target.

Application Load Balancer Behaving Badly

So last week we noticed a spike in our ELB 400 errors, and when we dove into the logs we discovered something strange that we can't figure out.
A request url formatted like below is hitting the Application Load Balancer 40 times a second consistently.
https://autodeploy-xxxxxxxx.us-east-1.elb.amazonaws.com:443
The IPs seem to come from Amazon themselves, which make sense since the above url is not how we send traffic to our ELB. and the ports the are coming over are not ports enabled on the ELB hence the 400 errors, but this is significantly increasing our costs as our Load Balancer Units have doubled because of it, and we haven't added any service that should act this way.
Sample IP: 34.207.214.113:21556
Any Ideas?

AWS CloudWatch Web Server Metrics

I have a few EC2 instances with NGINX installed using both ports 80 and 443. The instances are serving different applications so I'm not using an ELB.
I would like to create a CloudWatch alarm to make sure port 80 is always returning 200 HTTP status code. I realize there are several commercial solutions for this such as New Relic, etc, but this is the task I have at hand at the moment.
None of the EC2 metrics look to be able to accomplish this, and I cannot use any ELB metrics since I have no ELB.
What's the best way to resolve this?
You can definetly do this manually (send a request and update a metric directly sent to Cloudwatch). Monitor that metric.
Or you could look into Route53 health checks. You might get away with just configuring a health check there if you are already using Route53:
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html
Create a Route53 Heath Check. Supported protocols are TCP, HTTP, and HTTPS.
The HTTP/S protocol supports matching the response payload against a user-defined string so you can not only react to connectivity problems but also to unexpected content being returned to users.
For a more advanced monitoring enable Latency metrics which collect TTFB (time to first byte) and SSL handshake times.
You can then create alarms to get alerts when one your apps becomes inaccessible.