I have a mysql server hosted by AWS RDS, which is intermittently uncontactable. I've been doing a lot of development today on a page that, although hosted locally, executes a query against the remote server on each page load, and I've discovered that every few minutes, I suddenly can't reach the server - it only lasts for a few seconds, but during that time I get the error SQLSTATE[HY000] [2003] Can't connect to MySQL server from any attempt to reach it.
Existing, already open connections (i.e. from the command line client) are not affected, and can run queries as normal. It's only establishing a new connection that is impossible.
Why does this happen? How can I track down the cause?
You might be using a instance type too small, t1.micro and t2 instances can show that behavior sometimes. What instance-type are you using?
Your CPU capacity might be throttled, and that could be the reason you see intermittent performance.
Related
I am running an ECS instance on EC2 with an application load balancer, a route53 domain, and a RDS db. This is an internal business application that I have restricted IP access to.
I have ran this app for 3 weeks with no issues. However, today the data that the web app ingests is an abnormally large size. This is not a mistake. Due to this though, a webpage is taking approximately 4 minutes to complete which I verified on my local machine it completes. However, running the same operation on AWS fails at precisely 30 seconds every time.
I have connected the app running on my local machine to my production RDS db and am able to download and upload the data with no issue. So there is no issue with the RDS db. In addition, this same functionality has worked previously and only failed today due to the large amount of data.
I spent hours with Amazon support to solve this issue but we couldn't figure it out. I am assuming it is a setting for one the AWS services I am using that has a TTL or timeout set to 30 seconds, but I couldn't find it in any of the services I am using:
route53
RDS
ECS
ECR
EC2
Load Balancer
Target Group
You have a backend instance timeout, likely in the web server config.
Right now your ELB has a timeout of 60 seconds, but your assets are failing at 30.
There are only a couple assets on AWS with hardcoded timeouts like that. I'm thinking (because this is the first time it's happened), you have one of the following:
Size limits in the upstream, or
Time limits on connection keep-alive
Look at your website server software (httpd/nginx). Nginx has something called "upstream.conf" where you can set upstream timeouts. I'm not sure of httpd does as well.
Resources:
https://serverfault.com/questions/414987/nginx-proxy-timeout-while-uploading-big-files
From the NLB documentation, maybe relevant
EC2 instances must respond to a new request within 30 seconds in order to establish a return path.
I don't actually know what a return path is, nor what a 'response' is in this context since NLB has no concept of requests or responses.
- https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
EDIT: Disregard, this must have to do with UDP NATing. 'Response' here is probably a packet going back from the EC2 instance to the client
I am trying to connect to an instance(elastic search) deployed on aws from azure and randomly the connection is timed out.
I tried to curl the same instance(aws instance) from other aws instance and even from my local machine and its returning result just fine.
I am not sure what exactly is the issue in case of azure.
Earlier I thought it was the issue of heap memory, so increased it. For sometime it worked fine but has again started troubling me just the azure
I am running Bitnami WordPress on AWS server website working since two days but suddenly it stop showing anything and connection timeout is showing. The instance EC2 is running perfectly fine, and I have also seen IP logs, and nothing suspicious has come up.
Based on the above comments I guess the issue is with the internal web server
Make sure that the web server is running perfectly fine. And I do not mean just checking the EC2 instance state, because it is possible that the EC2 instance is running but the web server is down, causing the issue
First some insight into how my setup is:
1 ELB
4 EC2 instances
2 web servers
1 to run the migrations, queue (beanstalkd) and scheduler
1 'services' server (socket.io instance etc etc)
MySQL on RDS
Redis on Elasticache
S3 for user assets
Every day at 10:55PM, users report getting white screens and 502 Bad Gateway errors. The ELB reports that both EC2 instances are OutOfService, yet I'm SSH'd into them and fully able to use the site by bypassing the ELB. RDS and Elasticache maintenance windows aren't during this period, and the two instances aren't at load either. I can't find anything in the ELB access logs, nothing in nginx logs on the instance end, nothing in the Laravel app logs. There's nothing in the Laravel scheduler that runs at this time either.
The only thing I've found, is that in my CloudWatch metrics, the ELB latency spikes right up to about 5-10 seconds. All this results in downtime of about 5-15 minutes at the same time every day. I can't seem to find anything that is causing the issue.
I'm 100% stumped as to what could be causing this to happen. Any help is appreciated.
What probably happens is that your web servers run out of connections, ELB cannot perform health checks and takes them out of service. It's actually enough for one of the machines to experience this and be taken out of service and the other will be killed as a cascading effect.
How many connections can the web servers hold at the same time?
Do you process a particularly "heavy request" at that point in time when this happens?
Does adding more web servers solve your problem?
My Amazon EC2 small instance stopped responding, I looked at the AWS console and CPU use had gone through the roof. I tried rebooting instance but it didn't respond. So I stopped it and started it again (twice).
Now says the CPU usage is fine (was triggering an alarm when breaching 90%) but still can't login via SSH and Apache is not working (my sites are down).
Anyone give me any idea how I can sort this out? I'm out of my depth a bit as unfamiliar with the ins and outs of EC2.
EDIT: console log http://pastebin.com/JWFeG7NU shows Apache, SSH, etc starting up fine but I can't access via SSH and no response to pinging website hosted on server.
If you have stop/started your instance and you were not using an elastic IP address, your instance IP has changed.
If you were using an elastic IP address, it would have become disassociated.
If you do have applications that are causing you to exceed the allocated CPU, other applications such as ssh, may become slow to respond or not respond at all within the timeout.