I've been working with Amazon EC2 servers for 3+ years and I noticed a recurrent behaviour: some servers get frozen sporadically (between 1 to 5 times by year).
When this fact ocurs, I can't connect to server (tried http, mysql and ssh connections) till server be restarted.
The server back to work after a restart.
Sometimes the server goes online by 6+ months, sometimes the server get frozen about 1 month after a restart.
All servers I noticed this behavior were micro instances (North Virginia and Sao Paulo).
The servers have an ordinary Apache 2, Mysql 5, PHP 7 environment, with Ubuntu 16 or 18. The PHP/MySQL Web application is not CPU intensive and is not accessed by more than 30 users/hour.
The same environment and application on Digital Ocean servers does NOT reproduce the behaviour (I have two digital ocean servers running uninterrupted for 2+ years).
I like Amazon EC2 Servers, mainly because Amazon has a lot of useful additional services (like SES), but this behaviour is really frustating. Sometimes I got customers calls complaining about systems down and I just need an instance restart to solve the problem.
Does anybody have a tip about solving this problem?
UPDATE 1
They are t2.micro instances (1Gb RAM, 1 vCPU).
MySQL SHOW GLOBAL VARIABLES: pastebin.com/m65ieAAb
UPDATE 2
There is a CPU utilization peak in the logs, near the time when server was down. It was at 3AM. At this time there is a daily crontab task to make a database backup. But, considering this task runs everyday, why just sometimes it would make server get frozen?
I have not seen this exact issue, but on any cloud platform I assume any instance can fail at any time, so we design for failure. For example we have autoscaling on all customer facing instances. Anytime an instance fails, it is automatically replaced.
If a customer is calling to advise you a server is down, you may need to consider more automated methods of monitoring instance health and take automated action to recover the instance.
CloudWatch also has server recovery actions available that can be trigger if certain metric thresholds are reached.
Related
We've started using GCP Instance schedule for one of our VMs which needs to be up for 3 hours every night. For some reason, about once per week the VM is not up - services can't access it.
Checking from Logs Explorer, there are no errors or warnings, but on those days when it is not working, there are a few events which are not published/logged. These are the GCE Agent Started and OSConfig Agent Started events which happen on days where everything is OK (09-11, 09-12, 09-14) but are missing on days when the instance is not up (09-13).
The VM is Windows Server 2012 R2.
There is no retry policy implemented in the GCP instance schedule feature.
We know there are other ways to schedule VMs but we'd prefer to use the instance schedule feature if possible and if it is stable.
Is there somewhere else we should look for understanding why the VM is not starting properly?
This is the image from logs:
Instance schedules do not provide capacity guarantees, so if the resources required for a scheduled VM instance are not available at the scheduled time, your VM instance might not start when scheduled. Although you can reserve VM instances before starting them to provide capacity guarantees, reservations cannot be automatically scheduled.(Assuming that randomly VM instances are showing up this behaviour every week, not a particular VM every week.)
If it's with the same VM everytime then high memory utilization can also cause VM not being responsive. Manual reboot would fix this since it would close whatever is consuming the memory and re-open processes or services that may have been killed due to being OOM.
Please consider monitoring the VM memory usage by installing a monitoring agent, and increase the memory request based on the utilization.
We have an EC2 instance which becomes unreachable randomly. It has only started recently, and seems to only happen outside of business hours.
We are finding that the instance websites, WHM, SSH, even a terminal ping is all unreachable. However, the instance is running and health checks are fine in AWS console.
We used to have this with another instance but that just randomly stopped doing it at some point.
I have checked the CPU usage and the last 2 weeks, it has hit 100% 4 times but the times when that happened, are not when the instance goes down and I'm not sure they're even related.
The instance has WHM/cPanel installed, has not reached disk usage limit, nor bandwidth usage limit. We have cPHulk Brute Force Protection installed and running so surely can't be brute force attack?
It is resolved by stopping, then starting the instance, but we have clients viewing links and with the server going down outside of business hours and clients in different timezones.
I recommend you try installing a CloudWatch Agent to the EC2 instance in order to get the metrics and be able to analyze them further.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
I have a web service running on several EC2 boxes. Based on the Cloudwatch latency metric, I'd like to scale up additional boxes. But, given that it takes several minutes to spin up an EC2 from an AMI (with startup code to download the latest application JAR and apply OS patches), is there a way to have a "cold" server that could instantly be turned on/off?
Not by using AutoScaling. At least not, instant in the way you describe. You could make it much faster however, by making your own modified AMI image where you place the JAR and the latest OS patches. These AMI's can be generated as part of your build pipeline. In that case, your only real wait time is for the OS and services to start, similar to a "cold" server.
Packer is a tool commonly used for such use cases.
Alternatively, you can mange it yourself, by having servers switched off, and start them by writing some custom Lambda scripts that gets triggered by Cloudwatch alerts. But since stopped servers aren't exactly free either, i would recommend against that for cost reasons.
Before you venture into the journey of auto scaling your infrastructure and spending time/effort. Perhaps you should do a little bit of analysis on the traffic pattern day over day, week over week and month over month and see if it's even necessary? Try answering some of these questions.
What was the highest traffic ever your app handled, How did the servers fare given the traffic? How was the user response time?
When does your traffic ramp up or hit peak? Some apps get traffic during business hours while others in the evening.
What is your current throughput? For example, you can handle 1k requests/min and two EC2 hosts are averaging 20% CPU. if the requests triple to 3k requests/min are you able to see around 60% - 70% avg cpu? this is a good indication that your app usage is fairly predictable can scale linearly by adding more hosts. But if you've never seen traffic burst like that no point over provisioning.
Unless you have a Zynga like application where you can see large number traffic at once perhaps better understanding your traffic pattern and throwing in an additional host as insurance could be helpful. I'm making these assumptions as I don't know the nature of your business.
If you do want to auto scale anyway, one solution would be to containerize your application with Docker or create your own AMI like others have suggested. Still it will take few minutes to boot them up. Next option is the keep hosts on standby but and add those to your load balancers using scripts ( or lambda functions) that watches metrics you define (I'm assuming your app is running behind load balancers).
Good luck.
I am having sudden increases of high Disk IOPS on my EC2 instances. They are all running a Django 1.9.6 web application on it. The apps installed on it are Apache, Celery, New Relic Agent and Django Wsgi itself.
The application does not do any disk operations as such. Data is stored on RDS and Redis (Another server). The static files are stored on S3 and connected to cloudfront. So I am unable to determine what is the cause of this high Disk IOPS.
What happens is a normal request suddenly takes forever to respond. On checking cloudwatch and new relic I see the RAM usage shoots up. Then the instance is unresponsive. All requests time out and can't SSH in. When I contacted AWS Support they said the VolumeQueueLength was increasing significantly and once it came down (15-20 mins later) the instance was working fine.
Any ideas as to what could be the issue?
According to me, as it is mentioned in PCF's 4 levels of High Availability, when an instance(process) fails, the Monit should recognize it and shourd restart it. And then it'll just send the report to BOSH. But if the whole VM goes down, it's BOSH's responsibility to recognize and restart it.
With this belief I answered one question in : https://djitz.com/guides/pivotal-cloud-foundry-pcf-certification-exam-review-questions-and-answers-set-4-logging-scaling-and-high-availability/
Question and answer
According to me, the answer for this question should be option 3, but it says I'm wrong and answer should be option 2. Now I'm confused. So please help me if my belief is wrong.
BOSH is responsible for creating new instance for failed VM.
I know that there is not much information available on internet for this but if you get chance, there is tutorial on pluralsight you can enroll. There instructor has explained high availability very well.
But you can get high level idea from PCF documents as well.
Process Monitoring PCF uses a BOSH agent, monit, to monitor the
processes on the component VMs that work together to keep your
applications running, such as nsync, BBS, and Cell Rep. If monit
detects a failure, it restarts the process and notifies the BOSH agent
on the VM. The BOSH agent notifies the BOSH Health Monitor, which
triggers responders through plugins such as email notifications or
paging.
Resurrection for VMs BOSH detects if a VM is present by listening for
heartbeat messages that are sent from the BOSH agent every 60 seconds.
The BOSH Health Monitor listens for those heartbeats. When the Health
Monitor finds that a VM is not responding, it passes an alert to the
Resurrector component. If the Resurrector is enabled, it sends the
IaaS a request to create a new VM instance to replace the one that
failed.