How can I debug an AWS EC2 instance randomly becoming unreachable - amazon-web-services

We have an EC2 instance which becomes unreachable randomly. It has only started recently, and seems to only happen outside of business hours.
We are finding that the instance websites, WHM, SSH, even a terminal ping is all unreachable. However, the instance is running and health checks are fine in AWS console.
We used to have this with another instance but that just randomly stopped doing it at some point.
I have checked the CPU usage and the last 2 weeks, it has hit 100% 4 times but the times when that happened, are not when the instance goes down and I'm not sure they're even related.
The instance has WHM/cPanel installed, has not reached disk usage limit, nor bandwidth usage limit. We have cPHulk Brute Force Protection installed and running so surely can't be brute force attack?
It is resolved by stopping, then starting the instance, but we have clients viewing links and with the server going down outside of business hours and clients in different timezones.

I recommend you try installing a CloudWatch Agent to the EC2 instance in order to get the metrics and be able to analyze them further.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

Related

GCP VM instance schedule randomly not starting VM

We've started using GCP Instance schedule for one of our VMs which needs to be up for 3 hours every night. For some reason, about once per week the VM is not up - services can't access it.
Checking from Logs Explorer, there are no errors or warnings, but on those days when it is not working, there are a few events which are not published/logged. These are the GCE Agent Started and OSConfig Agent Started events which happen on days where everything is OK (09-11, 09-12, 09-14) but are missing on days when the instance is not up (09-13).
The VM is Windows Server 2012 R2.
There is no retry policy implemented in the GCP instance schedule feature.
We know there are other ways to schedule VMs but we'd prefer to use the instance schedule feature if possible and if it is stable.
Is there somewhere else we should look for understanding why the VM is not starting properly?
This is the image from logs:
Instance schedules do not provide capacity guarantees, so if the resources required for a scheduled VM instance are not available at the scheduled time, your VM instance might not start when scheduled. Although you can reserve VM instances before starting them to provide capacity guarantees, reservations cannot be automatically scheduled.(Assuming that randomly VM instances are showing up this behaviour every week, not a particular VM every week.)
If it's with the same VM everytime then high memory utilization can also cause VM not being responsive. Manual reboot would fix this since it would close whatever is consuming the memory and re-open processes or services that may have been killed due to being OOM.
Please consider monitoring the VM memory usage by installing a monitoring agent, and increase the memory request based on the utilization.

AWS Autoscaling Group EC2 instances go down during cron jobs

I tried autoscaling groups and alternatively just a bunch of EC2 instances tied by load balancer. Both configs are working fine at first glance.
But, when the EC2 is a part of autoscaling group it goes down sometimes. Actually it happens very often, almost once a day. And they go down in a "hard reset" way. The ec2 monitoring graphs show that CPU usage goes up to 100%, then the instance become not responsive and then it is terminated by autoscaling group.
And it has nothing to do with my processes on these instances.
When the instance is not a part of Autoscaling groups, it can work without the CPU usage spikes for years.
The "hard reset" on autoscaling group instances are braking my cron jobs. As much as I like the autoscaling groups I cannot use it.
It there a standard way to deal with the "hard resets"?
PS.
The cron jobs are running PHP scripts on Ubuntu in my case. I managed to make only one instance running the job.
It sounds like you have a health check that is failing when your cron is running, as as a result the instance is being taken out of service.
If you look at the ASG, there should be a reason listed for why the instance was taken out. This will usually be a health check failure, but there could be other reasons as well.
There are a couple things you can do to fix this.
First, determine why your cron is taking 100% of CPU, and how long it generally takes.
Review your health check settings. Are you using HTTP or TCP? What is the interval, and how many checks have to fail before it is taken out of service?
Between those two items, you should be able to adjust the health checks so that it doesn't take it out of service during the cron running time. It is possible that the instance is failing, typically this would be because it runs out of memory. If that is the case, you may want to consider going to a large instance type and/or enabling swap.
Once I had a similar issue, in that situation was the system auto update running. The system (Windows server) was downloaded a big update and took 100% of the CPU during hours. My suggestion is to try to monitoring which service is running at that moment (even if the SO is Linux), also check for any schedule task (as looks like it is running periodically). Other than that try to keep the task list opened during the event and see what is going on.

Why do AWS ec2 instances get degraded?

I've been trying to get details on this but no luck. I've observed that if an ec2 instance has been running for many days (say 30-40 days), it gets degraded. Terminating that instance works.
But,
Why do ec2 instances get degraded? Is it because of the hardware or the software that we are running on it?
Is there anyway to avoid it?
This is quite a rare occurrence.
It means that there is an issue with the hardware on which the instance is running.
You can change the hardware by Stopping the instance, and then Starting it again. This will provision the instance on a different host computer.
If you have experienced this multiple times, then there is probably an issue with the operating system or application running on the instance. You should contact AWS Support for assistance.

Amazon EC2 Servers getting frozen sporadically

I've been working with Amazon EC2 servers for 3+ years and I noticed a recurrent behaviour: some servers get frozen sporadically (between 1 to 5 times by year).
When this fact ocurs, I can't connect to server (tried http, mysql and ssh connections) till server be restarted.
The server back to work after a restart.
Sometimes the server goes online by 6+ months, sometimes the server get frozen about 1 month after a restart.
All servers I noticed this behavior were micro instances (North Virginia and Sao Paulo).
The servers have an ordinary Apache 2, Mysql 5, PHP 7 environment, with Ubuntu 16 or 18. The PHP/MySQL Web application is not CPU intensive and is not accessed by more than 30 users/hour.
The same environment and application on Digital Ocean servers does NOT reproduce the behaviour (I have two digital ocean servers running uninterrupted for 2+ years).
I like Amazon EC2 Servers, mainly because Amazon has a lot of useful additional services (like SES), but this behaviour is really frustating. Sometimes I got customers calls complaining about systems down and I just need an instance restart to solve the problem.
Does anybody have a tip about solving this problem?
UPDATE 1
They are t2.micro instances (1Gb RAM, 1 vCPU).
MySQL SHOW GLOBAL VARIABLES: pastebin.com/m65ieAAb
UPDATE 2
There is a CPU utilization peak in the logs, near the time when server was down. It was at 3AM. At this time there is a daily crontab task to make a database backup. But, considering this task runs everyday, why just sometimes it would make server get frozen?
I have not seen this exact issue, but on any cloud platform I assume any instance can fail at any time, so we design for failure. For example we have autoscaling on all customer facing instances. Anytime an instance fails, it is automatically replaced.
If a customer is calling to advise you a server is down, you may need to consider more automated methods of monitoring instance health and take automated action to recover the instance.
CloudWatch also has server recovery actions available that can be trigger if certain metric thresholds are reached.

Why are my compute engine instances getting restarted?

I'm running a cloud load balancer to dispatch incoming requests to two CPE instances.
It runs fine on some days, on others the instances are getting restarted for no obvious reason so all processes (mainly tomcat) are getting terminated and users are receiving errors.
I'm not running Preemptible VM instances (I've check according to Why do my google cloud compute instances always unexpectedly restart? )
How can I find out why the instances are getting restarted? The experience is getting more and more frustrating.
I used to run a cluster of cheap hosted servers for years before switching to GCP and never had any issues - and it was much (much) cheaper.
I thought I would get better performance and better scalability, but if the whole setup is not realiable, it does not make much sense.
How do I get any info WHY the instances are getting restarted. I do not find anything in my logs (neither in load balancer, nor in the compute engine logs).
Probably the instance are failing because of the health check or any other issue like (Live migrate, Terminate and Automatic restart event). That being said, I would recommend checking your stackdriver logs for the particular instance to see why the instances are getting restarted. Also at the same time also I would recommend you to check this article which will help you to understand and see logs for Live migrate, Terminate and Automatic restart this type of event.