I am running an ECS instance on EC2 with an application load balancer, a route53 domain, and a RDS db. This is an internal business application that I have restricted IP access to.
I have ran this app for 3 weeks with no issues. However, today the data that the web app ingests is an abnormally large size. This is not a mistake. Due to this though, a webpage is taking approximately 4 minutes to complete which I verified on my local machine it completes. However, running the same operation on AWS fails at precisely 30 seconds every time.
I have connected the app running on my local machine to my production RDS db and am able to download and upload the data with no issue. So there is no issue with the RDS db. In addition, this same functionality has worked previously and only failed today due to the large amount of data.
I spent hours with Amazon support to solve this issue but we couldn't figure it out. I am assuming it is a setting for one the AWS services I am using that has a TTL or timeout set to 30 seconds, but I couldn't find it in any of the services I am using:
route53
RDS
ECS
ECR
EC2
Load Balancer
Target Group
You have a backend instance timeout, likely in the web server config.
Right now your ELB has a timeout of 60 seconds, but your assets are failing at 30.
There are only a couple assets on AWS with hardcoded timeouts like that. I'm thinking (because this is the first time it's happened), you have one of the following:
Size limits in the upstream, or
Time limits on connection keep-alive
Look at your website server software (httpd/nginx). Nginx has something called "upstream.conf" where you can set upstream timeouts. I'm not sure of httpd does as well.
Resources:
https://serverfault.com/questions/414987/nginx-proxy-timeout-while-uploading-big-files
From the NLB documentation, maybe relevant
EC2 instances must respond to a new request within 30 seconds in order to establish a return path.
I don't actually know what a return path is, nor what a 'response' is in this context since NLB has no concept of requests or responses.
- https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
EDIT: Disregard, this must have to do with UDP NATing. 'Response' here is probably a packet going back from the EC2 instance to the client
Related
I'm currently in Sydney and I do have the following scenario:
1 RDS on N. Virginia.
1 EC2 on Sydney
1 EC2 on N. Virginia
I need this to redundation, and this is the simplified scenario.
When my app on EC2 sydney connection to RDS on N. Virgnia, it takes almost 2.5 seconds to give me the result. We can think: Ok, that's the latency.
BUT, when I send the request to EC2 N. Virginia, I get the result in less then 500ms.
Why there is a slow connection when you access RDS from outside the region?
I mean: I can experience this slow connection when I'm running the application on my computer too. But when the application is in the same region that RDS, works quickier that on my own computer.
Most likely you request to RDS requires multiple roundtrips to complete. I.e. at first your EC2 instance requests something to RDS, then something else based on the first request etc. Without seeing your database code, it's hard to say exactly what might be the cause of that.
You say then when you talk to the remote EC2 instance, instead, you get the response in less than 500 ms. That suggests that setting up a TCP connection and sending a single request with reply is 500 ms. Based on that, my guess is that your database connection requires at least 5x back and forth traffic.
There is no additional penalty with RDS in terms of using it out of region, but most database protocols are not optimized for high latency conditions. You might be much better off setting up a read replica in Sydney.
If you are trying to connect the RDS using public-facing network, then it might be slow. AWS launched cross region VPC peering, please peer all the region's VPC (make sure there will not be any IP conflict) and try to connect using private connections.
I'm attempting to profile some API requests for my application in an effort to isolate any potential bottlenecks. However, I'm seeing a slight discrepancy in one portion of my logging, so I'm trying to make sure I'm not missing anything.
The Basic Setup
AWS Classic ELB (cross zone enabled)
balances to 2 AWS t2.micro instances (Staging servers)
The ec2 boxes are running PHP 7.1 and Nginx.
There's an SSL Certificate but it terminates at the ELB. So the ELB is connecting to the backend instances via http on Port 80
The instances are in a private subnet and are connected to the internet via a NAT Gateway
Gathering Data
I have requests times being logged within the application itself, which are being sent to Loggly (So once the request hits the application code, we get a start time, and once the response has been sent the application, we get an end time)
AWS ELB Access Logs
CURL requests to a given endpoint with total_time enabled
The Results (for a given endpoint)
Application Request Time
I'm seeing Application request times of 0.05-0.08 seconds. This is the data being send to Loggly
AWS ELB Access Logs
These logs provide 3 numbers:
Request Processing Time: 0.00004 seconds
Backend Processing Time: 0.05 - 0.09 seconds
Response Processing Time: 0.00004 seconds
CURL
Now, running a CURL command that looks roughly like this:
curl -s -w "%{time_total}\n" -o /dev/null www.google.com
I'm getting total_time results of anywhere from: ~ 0.17 - .5 seconds
So, it seems that the Application logs and the ELB Access Logs show similar results. However, it seems that the overall request time via CURL, can be ~ 3x - 6x slower
If that's the case, what might cause this? My first thought is that it's taking time to connect to the ELB, or maybe we're getting held up on the DNS resolution.
I then tried some basic tests with Apache Benchmark, and that seems to confirm some sort of connection issue. The connection results are **~ 0.07 - 0.5 seconds **
However, because those times don't show up in the AWS ELB Access Logs, I'm led to believe this is something to do with DNS resolution? If so, how might I investigate this further?
Currently, my domain is registered on LiquidWeb and has a CNAME record which points to the ELB DNS name provided by AWS
First some insight into how my setup is:
1 ELB
4 EC2 instances
2 web servers
1 to run the migrations, queue (beanstalkd) and scheduler
1 'services' server (socket.io instance etc etc)
MySQL on RDS
Redis on Elasticache
S3 for user assets
Every day at 10:55PM, users report getting white screens and 502 Bad Gateway errors. The ELB reports that both EC2 instances are OutOfService, yet I'm SSH'd into them and fully able to use the site by bypassing the ELB. RDS and Elasticache maintenance windows aren't during this period, and the two instances aren't at load either. I can't find anything in the ELB access logs, nothing in nginx logs on the instance end, nothing in the Laravel app logs. There's nothing in the Laravel scheduler that runs at this time either.
The only thing I've found, is that in my CloudWatch metrics, the ELB latency spikes right up to about 5-10 seconds. All this results in downtime of about 5-15 minutes at the same time every day. I can't seem to find anything that is causing the issue.
I'm 100% stumped as to what could be causing this to happen. Any help is appreciated.
What probably happens is that your web servers run out of connections, ELB cannot perform health checks and takes them out of service. It's actually enough for one of the machines to experience this and be taken out of service and the other will be killed as a cascading effect.
How many connections can the web servers hold at the same time?
Do you process a particularly "heavy request" at that point in time when this happens?
Does adding more web servers solve your problem?
My Amazon EC2 small instance stopped responding, I looked at the AWS console and CPU use had gone through the roof. I tried rebooting instance but it didn't respond. So I stopped it and started it again (twice).
Now says the CPU usage is fine (was triggering an alarm when breaching 90%) but still can't login via SSH and Apache is not working (my sites are down).
Anyone give me any idea how I can sort this out? I'm out of my depth a bit as unfamiliar with the ins and outs of EC2.
EDIT: console log http://pastebin.com/JWFeG7NU shows Apache, SSH, etc starting up fine but I can't access via SSH and no response to pinging website hosted on server.
If you have stop/started your instance and you were not using an elastic IP address, your instance IP has changed.
If you were using an elastic IP address, it would have become disassociated.
If you do have applications that are causing you to exceed the allocated CPU, other applications such as ssh, may become slow to respond or not respond at all within the timeout.
Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers
You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.
TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.
Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/