AWS: Understanding Response Times and ELB - amazon-web-services

I'm attempting to profile some API requests for my application in an effort to isolate any potential bottlenecks. However, I'm seeing a slight discrepancy in one portion of my logging, so I'm trying to make sure I'm not missing anything.
The Basic Setup
AWS Classic ELB (cross zone enabled)
balances to 2 AWS t2.micro instances (Staging servers)
The ec2 boxes are running PHP 7.1 and Nginx.
There's an SSL Certificate but it terminates at the ELB. So the ELB is connecting to the backend instances via http on Port 80
The instances are in a private subnet and are connected to the internet via a NAT Gateway
Gathering Data
I have requests times being logged within the application itself, which are being sent to Loggly (So once the request hits the application code, we get a start time, and once the response has been sent the application, we get an end time)
AWS ELB Access Logs
CURL requests to a given endpoint with total_time enabled
The Results (for a given endpoint)
Application Request Time
I'm seeing Application request times of 0.05-0.08 seconds. This is the data being send to Loggly
AWS ELB Access Logs
These logs provide 3 numbers:
Request Processing Time: 0.00004 seconds
Backend Processing Time: 0.05 - 0.09 seconds
Response Processing Time: 0.00004 seconds
CURL
Now, running a CURL command that looks roughly like this:
curl -s -w "%{time_total}\n" -o /dev/null www.google.com
I'm getting total_time results of anywhere from: ~ 0.17 - .5 seconds
So, it seems that the Application logs and the ELB Access Logs show similar results. However, it seems that the overall request time via CURL, can be ~ 3x - 6x slower
If that's the case, what might cause this? My first thought is that it's taking time to connect to the ELB, or maybe we're getting held up on the DNS resolution.
I then tried some basic tests with Apache Benchmark, and that seems to confirm some sort of connection issue. The connection results are **~ 0.07 - 0.5 seconds **
However, because those times don't show up in the AWS ELB Access Logs, I'm led to believe this is something to do with DNS resolution? If so, how might I investigate this further?
Currently, my domain is registered on LiquidWeb and has a CNAME record which points to the ELB DNS name provided by AWS

Related

EC2 Instance giving better output than ELB

We have cluster of 3 EC2 instances. Single EC2 instance is able to server aroung 500 user load on application. But when same EC2 instnace is put in ELB is not even serving for 250 users. We drilled more & put below configuration at different end.
Optimized code to respond in less time.
ELB is set with 300 sec timeout for all responses & healthy.unhealthy checks.
Apache on EC2 is set with 600 as timeout value & keep alive it set true.
ELB is routing request in equal distribution logic.
Every time we hit with higher load(500 on cluster) we are getting end up getting some failures with 504 bad gateway timeout error. Kindly help with solution to get more optimial output.

AWS EC2 instance fails consistently at 30 seconds on long page load

I am running an ECS instance on EC2 with an application load balancer, a route53 domain, and a RDS db. This is an internal business application that I have restricted IP access to.
I have ran this app for 3 weeks with no issues. However, today the data that the web app ingests is an abnormally large size. This is not a mistake. Due to this though, a webpage is taking approximately 4 minutes to complete which I verified on my local machine it completes. However, running the same operation on AWS fails at precisely 30 seconds every time.
I have connected the app running on my local machine to my production RDS db and am able to download and upload the data with no issue. So there is no issue with the RDS db. In addition, this same functionality has worked previously and only failed today due to the large amount of data.
I spent hours with Amazon support to solve this issue but we couldn't figure it out. I am assuming it is a setting for one the AWS services I am using that has a TTL or timeout set to 30 seconds, but I couldn't find it in any of the services I am using:
route53
RDS
ECS
ECR
EC2
Load Balancer
Target Group
You have a backend instance timeout, likely in the web server config.
Right now your ELB has a timeout of 60 seconds, but your assets are failing at 30.
There are only a couple assets on AWS with hardcoded timeouts like that. I'm thinking (because this is the first time it's happened), you have one of the following:
Size limits in the upstream, or
Time limits on connection keep-alive
Look at your website server software (httpd/nginx). Nginx has something called "upstream.conf" where you can set upstream timeouts. I'm not sure of httpd does as well.
Resources:
https://serverfault.com/questions/414987/nginx-proxy-timeout-while-uploading-big-files
From the NLB documentation, maybe relevant
EC2 instances must respond to a new request within 30 seconds in order to establish a return path.
I don't actually know what a return path is, nor what a 'response' is in this context since NLB has no concept of requests or responses.
- https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
EDIT: Disregard, this must have to do with UDP NATing. 'Response' here is probably a packet going back from the EC2 instance to the client

AWS - Abnormal Data Transfer OUT

I´m consistently being charged for a surprisingly high amount of data transfer out (from Amazon to Internet).
I looked into the Usage Reports of the past few months and found out that the Data Transfer Out was coming out of an Application Load Balancer (ALB) between the Internet and multiple nodes of my application (internal IPs).
Also noticed that DataTransfer-Out-Bytes is very close to the DataTransfer-In-Bytes in the same load balancer, which is weird (coincidence?). I was expecting the response to each request to be way smaller than the request itself.
So, I enabled flow logs in the ALB for a few minutes and found out the following:
Requests coming from the Internet (public IPs) in to ALB = ~0.47 GB;
Requests coming from ALB to application servers in the same availability zone = ~0.47 GB - ALB simply passing requests through to application servers, as expected. So, about the same amount of traffic.
Responses from application servers back into the same ALB = ~0.04 GB – As expected, responses generate way less traffic back into ALB. Usually a 1K request gets a simple “HTTP 200 OK” response.
Responses from ALB back to the external IP addresses => ~0.43 GB – this was mind-blowing. I was expecting ~0.04GB, the same amount received from the application servers.
Unfortunately, ALB does not allow me to use packet sniffers (e.g. tcpdump) to see that is actually coming in and out. Is there anything I´m missing? Any help will be much appreciated. Thanks in advance!
Ricardo.
I believe the next step in your investigation would be to enable ALB access logs and see whether you can correlate the "sent_bytes" in the ALB access log to either your Flow log or your bill.
For information on ALB access logs see: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html
There is more than one way to analyze the ALB access logs, but I've always been happy to use Athena, please see: https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/

ECS Fargate + Network Load Balancer Healthcheck

I'm experiencing an issue with the following setup:
API Gateway -> VPC Link -> Private NLB -> Target Group -> AWS ECS Fargate
If I setup the NLB's Health Check to be TCP/HTTP on a specified endpoint, that endpoint gets hammered to the death with internal request (no requests are coming through the API Gateway, I checked):
My problem with this behaviour, other than having the health's endpoint spammed by my own architecture is that the application's functionality is suffering (I keep getting slow responses 1 out of 4 get request to the API).
I tried to modify the Health Check's behaviour to only TCP, same slow responses.
I tried temporarily switching to a public ALB, I'm incurring in double health-checks, separated by 30 seconds but my application is responding with an average of 100 ms.
So, as an example of what I mean by "double health-checks":
Health Check 1.1 at 00:00:00
Health Check 2.1 at 00:00:10
Health Check 1.2 at 00:00:30
Health Check 2.2 at 00:00:40
Any ideas?
TL/DR;
Enable the "Cross-Zone Load Balancing" NLB flag.
The issue was the "cross-availability zone" not checked out.
It seems that when a request gets processed by a NLB-node which resides in a different AZ from the one that it is trying to be redirecting, it tries to internally resolve the IP in the AZ, if it fails, it redirects the request to another NLB-node in the appropriate AZ, which will be able to do so, hence reaching the target.

HTTP 504 errors returned by ELB even when hosts are healthy and able to serve request

I have a service which is deployed on Amazon Web Services (AWS), specifically 2 instances behind an Elastic Load Balancer (ELB). Availability zones are selected as all three us-west-2a,b,c
but only 2 of the above 3 zones have instances running in it.
The issues is that even though the traffic/load is not too high but I still get HTTP 504 errors from ELB often enough.
The log lines reads like this
-1 -1 -1 504 0 0 0
In order, --request_processing_time --backend_processing_time --response_processing_time --elb_status_code --backend_status_code --received_bytes --sent_bytes.
Description of what each field and response means can be found here
ELB idle timeout is 60 seconds. KeepAlive is 'On' on backend instances. Latency of requests from ELB are in check. I have tried increasing KeepAliveTimeout but to no avail.
Does anyone have any idea about how to proceed? I don't even know the root cause of this issue.
PS: More like a second question, there are a few cases (much less than 504 being returned by ELB when backend does not even accept the request) where even backend is returning a 504 and then ELB is forwarding the same to client. To the best of my knowledge HTTP 504 should be returned by a proxy only when backend is timing out. How can a server itself return a 504?
So that it might assist others in future, I am publishing my finding(s) here:
1) This 504 0 HTTP errors were mainly because of logrotate reloading apache instead of graceful restart.
The current AWS config does the following
/sbin/service httpd reload > /dev/null 2>/dev/null || true
so replace the service command with either apachectl -k graceful or /sbin/service httpd graceful
File location on my ec2 instance: /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
2) Because logrotate frequency was too high by default in AWS (once every hour), at least for my use case, and that in turn was reloading apache every hour, so I reduced that as well.
When backend connection timeout, ELB will put -1 to backend_processing_time column in its access log. Think what's happening is that some of your requests take longer than 60s for your backend to process. To confirm this, can you check your latency metrics? Please switch to maximum when viewing this metric. It will confirm my guess if you see latency frequently reaches 60s.
After it got confirmed, you might want to increase Idle timeout of your ELB and backend.