Why are my compute engine instances getting restarted? - google-cloud-platform

I'm running a cloud load balancer to dispatch incoming requests to two CPE instances.
It runs fine on some days, on others the instances are getting restarted for no obvious reason so all processes (mainly tomcat) are getting terminated and users are receiving errors.
I'm not running Preemptible VM instances (I've check according to Why do my google cloud compute instances always unexpectedly restart? )
How can I find out why the instances are getting restarted? The experience is getting more and more frustrating.
I used to run a cluster of cheap hosted servers for years before switching to GCP and never had any issues - and it was much (much) cheaper.
I thought I would get better performance and better scalability, but if the whole setup is not realiable, it does not make much sense.
How do I get any info WHY the instances are getting restarted. I do not find anything in my logs (neither in load balancer, nor in the compute engine logs).

Probably the instance are failing because of the health check or any other issue like (Live migrate, Terminate and Automatic restart event). That being said, I would recommend checking your stackdriver logs for the particular instance to see why the instances are getting restarted. Also at the same time also I would recommend you to check this article which will help you to understand and see logs for Live migrate, Terminate and Automatic restart this type of event.

Related

AWS ECS does not drain connections or remove tasks from Target Group before stopping them

I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.

AWS Autoscaling Group EC2 instances go down during cron jobs

I tried autoscaling groups and alternatively just a bunch of EC2 instances tied by load balancer. Both configs are working fine at first glance.
But, when the EC2 is a part of autoscaling group it goes down sometimes. Actually it happens very often, almost once a day. And they go down in a "hard reset" way. The ec2 monitoring graphs show that CPU usage goes up to 100%, then the instance become not responsive and then it is terminated by autoscaling group.
And it has nothing to do with my processes on these instances.
When the instance is not a part of Autoscaling groups, it can work without the CPU usage spikes for years.
The "hard reset" on autoscaling group instances are braking my cron jobs. As much as I like the autoscaling groups I cannot use it.
It there a standard way to deal with the "hard resets"?
PS.
The cron jobs are running PHP scripts on Ubuntu in my case. I managed to make only one instance running the job.
It sounds like you have a health check that is failing when your cron is running, as as a result the instance is being taken out of service.
If you look at the ASG, there should be a reason listed for why the instance was taken out. This will usually be a health check failure, but there could be other reasons as well.
There are a couple things you can do to fix this.
First, determine why your cron is taking 100% of CPU, and how long it generally takes.
Review your health check settings. Are you using HTTP or TCP? What is the interval, and how many checks have to fail before it is taken out of service?
Between those two items, you should be able to adjust the health checks so that it doesn't take it out of service during the cron running time. It is possible that the instance is failing, typically this would be because it runs out of memory. If that is the case, you may want to consider going to a large instance type and/or enabling swap.
Once I had a similar issue, in that situation was the system auto update running. The system (Windows server) was downloaded a big update and took 100% of the CPU during hours. My suggestion is to try to monitoring which service is running at that moment (even if the SO is Linux), also check for any schedule task (as looks like it is running periodically). Other than that try to keep the task list opened during the event and see what is going on.

AWS ECS Fargate: Application Load Balancer Returns 503 with a Weird Pattern

(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE
It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.

How can I debug an AWS EC2 instance randomly becoming unreachable

We have an EC2 instance which becomes unreachable randomly. It has only started recently, and seems to only happen outside of business hours.
We are finding that the instance websites, WHM, SSH, even a terminal ping is all unreachable. However, the instance is running and health checks are fine in AWS console.
We used to have this with another instance but that just randomly stopped doing it at some point.
I have checked the CPU usage and the last 2 weeks, it has hit 100% 4 times but the times when that happened, are not when the instance goes down and I'm not sure they're even related.
The instance has WHM/cPanel installed, has not reached disk usage limit, nor bandwidth usage limit. We have cPHulk Brute Force Protection installed and running so surely can't be brute force attack?
It is resolved by stopping, then starting the instance, but we have clients viewing links and with the server going down outside of business hours and clients in different timezones.
I recommend you try installing a CloudWatch Agent to the EC2 instance in order to get the metrics and be able to analyze them further.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

AWS/EC2 - Initially working instances, become inaccessible, although still running.

Issue in a nutshell:
Simple-singular-practice ec2 instances are unexpectedly just falling off the grid even though they are still running, and I have to keep recreating them ,and if not, ssh accessing or online public DNS accessing will result in a "Timeout".
Little More Details Outside the Nutshell :)
I've followed the setting up a LAMP server instructions to the "T" and successfully have served up basic HTML pages.
Everything initially works fine:
I can ssh into the instance no problem
When accessing the public DNS online - the expected html pages render just fine.
Problem:
But then, quiet randomly, I can no longer access the instance through ssh and even online, the public DNS is inaccessible.
In both cases they just "Timeout"
Config:
Basic Free Tier
Amazon Linux AMI 2015.09.1 (HVM), SSD Volume Type
t2.micro
Number of Instances - 1
Auto-assign Public IP(Enabled)
Ports - 22(My IP),80(0.0.0.0),443(0.0.0.0)
Using a key pair
Question:
What typically causes instances freezing up like this?
LAMP stacks on EC2 are extremely common, and the guide you're following is extremely popular and has been used for years so it's likely you've gone wrong somewhere or the problem is something more sinister.
If you can't access the instance by any means, it would sound like it has become overloaded. Unless you've accidentally changed a firewall rule on the AWS side (eg. Security Groups, NACLS) or something on the instance level (eg. IP Tables).
Open up ICMP on your security group and try pinging the instance and see if you get a response.
After you've verified all your firewalls and you've tried to connect to it through every means, check out the logs, they're your friend.
To check the logs, start at the AWS level. CloudWatch records lots of data about your instance - CPU Utilization, Network In & Out and more. Check all of these through the AWS Console ensuring you select the "Maximum" statistic and not "Average". Also, take a look at the "StatusCheckFailed_System" (Hardware problem) and "StatusCheckFailed_Instance" (Instance not responding to health check probes) metrics to see if they have any story to tell. See the docs here and here for more info.
Next, reboot the instance and try stop starting and reconnect via SSH. Check you application logs (if any) and check your Apache Logs and Linux Logs to see what happened.
But to answer your question, what typically causes a instance to freeze up like this:
Bad Application code that sucks up all the CPU overloading the instance
Too much traffic overloading the instance
Running too many services on the instance that it's unable to handle
AWS Hardware problem - Uncommon