I am currently running a process on an ec2 server that needs to run consistently in the background. I tried to login to the server and I continue to get a Network Error: Connection timed out prompt. When I check the instance, I get the following message:
Instance reachability check failed at February 22, 2020 at 11:15:00 PM UTC-5 (1 days, 13 hours and 34 minutes ago)
To troubleshoot, I have tried rebooting the server but that did not correct the problem. How do I correct this and also prevent it from happening again?
An instance status check failure indicates a problem with the
instance, such as:
Failure to boot the operating system
Failure to mount volumes correctly
File system issues
Incompatible drivers
Kernel panic
Severe memory pressures
You can check following for troubleshooting
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesStopping.html
For future reprting and auto recovery you can create a CloudWatch
Alarm
For second part
Nothing you can do to stop its occurrence, but for up-time and availability YES you can create another EC2 and add ALB on the top of both instances which checks the health of instance, so that your users/customers/service might be available during recovery time (from second instance). You can increase number of instances as more as you want for high availability (obviously it involves cost)
I've gone through the same problem
and then once looking at the EC2 dashboard could see that something wasn't right with it
but for me rebooting
and waiting for a 2-3 minutes solved it and then was able to SSH to the instance just fine
If that becomes a recurrent problem, then I'll follow through with Jeremy Thompson's advice
... put the EC2's in an Auto Scaling Group. The ALB does a health check and it fails will no longer route traffic to that EC2, then the ASG will send a Status check and take the unresponding server out of rotation.
Related
Bit panicky here because I can't troubleshoot the error on a production site and it appears to be completely down.
GCP - Compute Engine VM - N1-standard on the US-West-3C zone running a Bitnami Multisite Wordpress deployment
About 2 hours ago my VM stopped responding (as far as I could tell with monitoring tools) and I was unable to SSH into it or connect in any way. I've experienced this occasionally in the past so my process was to grab a snapshot and restart the VM. I did manage to get the snapshot, however it stopped the VM by itself and I'm now stuck where I can't restart the VM.
The error I'm getting is:
Failed to start name-of-vm: A n1-standard-1 VM instance is currently unavailable in the us-west3-c zone. Alternatively, you can try your request again with a different VM hardware configuration or at a later time. For more information, see the troubleshooting documentation.
I tried changing my configuration (it used to be a custom VM) but that didn't do anything.
Searching for similar errors I've found threads about certain Zones running out of resources, but as far as I can tell this error doesn't specifically say 'run out of resources' and the status of the US-West-3C zone is fine. I can't imagine it would run out in a way where it can't even start a measly n1 vm.
Unfortunately due to some mismanagement this project isn't umbrella'd in our Google Workspace/Organization so I can't request technical support for it.
Any assistance or help pointing to some resources would be greatly appreciated.
currently unavailable in a specific zone would also mean that the zone run out of resources for the certain machine type.
You can try to restore the snapshot you had created on a different machine type e2-standard or n2-standard machine type configuration
In my aws account ec2 t2-medium instance. From yesterday onward my server is not loading and I am not able to SSH to this instance. when i see machine its show instance status check fail but when stop and start machine problem gone but how to find root cause of issue ?
Basically it's a hardware issue. Whenever you launch your instance, AWS always monitors the health of your EC2 instance. There might be multiple reasons for status check failure :-
Networking or startup configuration issues
Exhausted memory
File system issues
Failure to boot the operating system
Failure to mount volumes correctly
Incompatible drivers
CPU exhaustion
Whenever you stop & start, the instance AWS shifts your instance to new Hardware (As behind the seen it uses xen virtualization) & problem is solved.
For identifying the root cause of status check failure, you can retrive the system logs. For retriving system logs :
Select the instance -> goto Action -> monitor & troubleshoot -> Get system logs
For more information please check AWS Doc
We have an EC2 instance which becomes unreachable randomly. It has only started recently, and seems to only happen outside of business hours.
We are finding that the instance websites, WHM, SSH, even a terminal ping is all unreachable. However, the instance is running and health checks are fine in AWS console.
We used to have this with another instance but that just randomly stopped doing it at some point.
I have checked the CPU usage and the last 2 weeks, it has hit 100% 4 times but the times when that happened, are not when the instance goes down and I'm not sure they're even related.
The instance has WHM/cPanel installed, has not reached disk usage limit, nor bandwidth usage limit. We have cPHulk Brute Force Protection installed and running so surely can't be brute force attack?
It is resolved by stopping, then starting the instance, but we have clients viewing links and with the server going down outside of business hours and clients in different timezones.
I recommend you try installing a CloudWatch Agent to the EC2 instance in order to get the metrics and be able to analyze them further.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
Newbie to Amazon Web Services here. I launched an instance from a Public AMI and found that I could not ssh into the instance - I received the error "Connection timed out." I checked the security groups to verify that Port 22 was associated with 0.0.0.0/0. Additionally, I checked the route tables to verify that 0.0.0.0/0 is associated with target gateway attached to the VPC.
I find that only 1/2 status checks have passed - the instance status check failed. I have tried stopping and starting the instance as well as terminated and launching a new instance, both to no avail. The error that I see in the system log is:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1).
From this previous question, it appears that this could be a virtualization issue, but I'm not sure if that was due to something I did on my end when launching the instance or something that occurred from the creators of the AMI? Ec2 1/2 checks passed
Any help would be appreciated!
Can you share any more details about how you deployed the instance? Did you use the AWS Management Console, or one of the command line tools or SDKs to deploy it? Which public AMI did you use? Was it one of the ones provided by Amazon?
Depending on your needs, I would make sure that you use one of the AMIs provided by Amazon, such as Ubuntu, Amazon Linux, CentOS, etc. Here's the links to the docs on AMIs, but you can learn quite a bit by just searching for images. Since you mentioned virtualization types though, I'd suggest reading up briefly on the HVM vs. Paravirtual virtualization types on AWS. Each of the instance types / families uses a certain virtualization type, which is indicated in the chart on this page.
Instance Status Checks
This documentation page covers the instance status checks, which you'll probably want to familiarize yourself with. It's entirely possible that shutting down (not restart, but shutdown) and then starting the instance back up might resolve the instance status check.
Spot Instances - cost savings!
By the way, I'll just mention this since you indicated that you're new to AWS ... if you're just playing around right now, you can save a ton of cost by deploying EC2 Spot Instances, instead of paying the normal, on-demand rates. Depending on current rates, you can save more than 50%, and per-second billing still applies. Although there's the possibility that your EC2 instance could get "interrupted" based on market demand, you can configure your Spot Instance to just "Hibernate" or "Stop" instead of terminating and relaunching. That way, your work is instance state is saved for when it relaunches.
Hope this helps!
1) Use well-known images or contact with the image developer. Perhaps it requires more than one drive or tricky partitioning.
2) make sure you selected proper HVM/PV image according to the instance type.
3) (after checks are passed) make sure the instance has public ip
I have a PHP application deployed to Amazon Elastic Beanstalk. But I notice a problem that every time I push my code changes via git aws.push to the Elastic Beanstalk, the application deployed didn't picked up the changes. I checked the events log on my application Beanstalk environment and notice that every time the Beanstalk issues:
Deploying new version to instance(s)
it's always followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own):
[i-d5xxxxx]
The same thing happens when I try to request snapshot logs. The Beanstalk issues:
requestEnvironmentInfo is starting
then after a few minutes it's again followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-d5xxxxx].
I had this problem a few times. It seems to affect only particular instances. So it can be solved by terminating the EC2 instance (done via the EC2 page on the Management Console). Thereafter, Elastic Beanstalk will detect that there are 0 healthy instances and automatically launch a new one.
If this is a production environment and you have only 1 instance and you want minimal down time
configure minimum instances to 2, and Beanstalk will launch another instance for you.
terminate the problematic instance via EC2 tab, Beanstalk will launch another instance for you because minimum instance is 2
configure minimum instance back to 1, Beanstalk will remove one of your two instances.
By default Elastic Beanstalk "throws a timeout exception" after 8 minutes (480 seconds defined in settings) if your commands did not complete in time.
You can set an higher time up to 30 minutes (1800 seconds).
{
"Namespace": "aws:elasticbeanstalk:command",
"OptionName": "Timeout",
"Value": "1800"
}
Read here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html
Had the same issue here (single t1.micro instance).
Did solve the problem by rebooting the EC2 instance via the EC2 page on the Management Console (and not from EB page).
Beanstalk deployment (and other features like Get Logs) work by sending SQS commands to instances. SQS client is deployed to instances and checks queue about every 20 secs (see /var/log/cfn-hup.log):
2018-05-30 10:42:38,605 [DEBUG] Receiving messages for queue https://sqs.us-east-2.amazonaws.com/124386531466/93b60687a33e19...
If SQS Client crashes or has network problems on t1/t2 instances then it will not be able to receive commands from Beanstalk, and deployment would time out. Rebooting instance restarts SQS Client, and it can receive commands again.
An easier way to fix SQS Client is to restart cfn-hup service:
sudo service cfn-hup restart
In the case of deployment, an alternative to shutting down the EC2 instances and waiting for Elastic Beanstalk to react, or messing about with minimum and maximum instances, is to simply perform a Rebuild environment on the target environment.
If a previous deployment failed due to timeout then the new version will still be registered against the environment, but due to the timeout it will not appear to be operational (in my experience the instance appears to still be running the old version).
Rebuilding the environment seems to reset things with the new version being used.
Obviously there's the downside with that of a period of downtime.
I think is the correct way to deal with this.
I think the correct way to deal with this is to figure out the cause of the timeout by doing what this answer suggests.
chongzixin's answer is what needs to be done if you need this fixed ASAP before investigating the reason for a timeout.
However, if you do need to increase timeout, see the following:
Add configuration files to your source code in a folder named .ebextensions and deploy it in your application source bundle.
Example:
option_settings:
"aws:elasticbeanstalk:command":
Timeout: 2400
*"value" represents the length of time before timeout in seconds.
Reference: https://serverfault.com/a/747800/496353
"Restart App Server(s)" from the "Actions" menu in Elastic Beanstalk management dashboard followed by eb deploy fixes it for me.
Visual cue for the first instruction
After two days of checking random issues, I restarted both EC2 instances one after another to make sure there is no downtime. Site worked fine but after a while, website started throwing error 504.
When I checked the http server, nginx was off and "Out of HDD space" was thrown. "Increased the HDD size", elastic beanstalk created new instances and the issue was fixed.
For me, the problem was my VPC security group rules. According to the docs, you need to allow outbound traffic on port 123 for NTP to work. I had the port closed, so the clock was drifting, and so the EC2's were becoming unresponsive to commands from the Elastic Beanstalk environment, taking forever to deploy (only to time out) failing to get logs, etc.
Thank you #Logan Pickup for the hint in your comment.