AWS EC2 Instance Troubleshooting SSM Agent Ping

AWS EC2 Instance Troubleshooting SSM Agent Ping - amazon-web-services

all. We manage an EC2 instance (windows) that hosts our on-premises Power BI Gateway. The last few days, we have noticed that we're unable to rdp into the instance. When we check Fleet Manager, we see that "SSM Agent Ping Status" is at Connection Lost:
After an instance reboot, the issue resolves itself. This has happened daily for three days now. I have verified that the ssm agent version is not dated. The current version installed on instance is 3.1.1575, which per the releases doc (below) was released on June 6, 2022 (seems fairly recent). There is one version more recent released 7-14-2022:
https://github.com/aws/amazon-ssm-agent/releases
I looked into the "Connection Lost" status and saw the following:
If an instance fails a health check, AWS OpsWorks Stacks autoheals registered Amazon EC2 instances and changes the status of registered on-premises instances to connection lost. There's some confusion here, I don't recall registering the instance under AWS OpsWorks Stacks -- unless, this is automatic.
Is there anyway to troubleshoot and get to the bottom of this? We can temporarily reboot instance, but I'd like to understand what's causing the issue... Thank you!

Related

AWS EC2 keeps shutting down automatically

I am trying to create an API that runs on AWS EC2 t2.micro. The problem I'm having is that my instance keeps shut down automatically every ~3 hours, which could be because of the "session time" of my AWS Educate account (screenshot attached)
Is there any way to keep my instance running constantly (for days and even months)?
I am using "tmux", which does seem to keep my API and the EC2 instance running even after my ssh connection is terminated, but the EC2 instance itself still shuts down automatically.
EDIT: If it is not possible to keep an EC2 instance of an AWS Educate account running constantly. Is there a way to start a new session automatically when the old instance's session has expired?
(Maybe using a script/using some tools offered by AWS? I'm new to AWS so I don't know if this is possible)

Sadly you can't change that. It is explicitly stated in AWS educate docs:
When your session ends, your resources will be “stopped.” You will be required to re-start your resources when you start a new session.

Instead of Using AWS Educate you can Create Regular AWS account which provides some services for free for one year. It includes the EC2 instance as well so you don't have to pay anything and you can run for months and year it will never gone down until you manually stop it.

ec2 Instance Status Check Failed

I am currently running a process on an ec2 server that needs to run consistently in the background. I tried to login to the server and I continue to get a Network Error: Connection timed out prompt. When I check the instance, I get the following message:
Instance reachability check failed at February 22, 2020 at 11:15:00 PM UTC-5 (1 days, 13 hours and 34 minutes ago)
To troubleshoot, I have tried rebooting the server but that did not correct the problem. How do I correct this and also prevent it from happening again?

An instance status check failure indicates a problem with the
instance, such as:
Failure to boot the operating system
Failure to mount volumes correctly
File system issues
Incompatible drivers
Kernel panic
Severe memory pressures
You can check following for troubleshooting
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesStopping.html
For future reprting and auto recovery you can create a CloudWatch
Alarm
For second part
Nothing you can do to stop its occurrence, but for up-time and availability YES you can create another EC2 and add ALB on the top of both instances which checks the health of instance, so that your users/customers/service might be available during recovery time (from second instance). You can increase number of instances as more as you want for high availability (obviously it involves cost)

I've gone through the same problem
and then once looking at the EC2 dashboard could see that something wasn't right with it
but for me rebooting
and waiting for a 2-3 minutes solved it and then was able to SSH to the instance just fine
If that becomes a recurrent problem, then I'll follow through with Jeremy Thompson's advice
... put the EC2's in an Auto Scaling Group. The ALB does a health check and it fails will no longer route traffic to that EC2, then the ASG will send a Status check and take the unresponding server out of rotation.

EC2 persistent instance retirement scheduled

On the personal health dashboard in the AWS Console, I've got this notification
EC2 persistent instance retirement scheduled
yesterday which says that one of my ec2 instances is scheduled to retire on 13th March 2019. The status was 'upcoming' while the start and end times both were set to 14-Mar-2019.
The content of the notification starts with:
Hello,
EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-xxxxxxxxxx) associated with your AWS account (AWS Account ID: xxxxxxxxxx) in the xxxx region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2019-03-13 00:00 UTC.
....
I've got yet another notification today for the same instance and with the same subject line but the status has been changed to 'ongoing' and the start time is 27-Feb-2019 while the end time is 14-Mar-2019.
I was planning to do a start-stop of the instance next week but does the second notification tell me to do is ASAP?

Yes, it is better to do stop/start ASAP. Even in your message it says:
Due to this degradation your instance could already be unreachable

Amazon EC2 instance passed 1/2 checks

Newbie to Amazon Web Services here. I launched an instance from a Public AMI and found that I could not ssh into the instance - I received the error "Connection timed out." I checked the security groups to verify that Port 22 was associated with 0.0.0.0/0. Additionally, I checked the route tables to verify that 0.0.0.0/0 is associated with target gateway attached to the VPC.
I find that only 1/2 status checks have passed - the instance status check failed. I have tried stopping and starting the instance as well as terminated and launching a new instance, both to no avail. The error that I see in the system log is:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1).
From this previous question, it appears that this could be a virtualization issue, but I'm not sure if that was due to something I did on my end when launching the instance or something that occurred from the creators of the AMI? Ec2 1/2 checks passed
Any help would be appreciated!

Can you share any more details about how you deployed the instance? Did you use the AWS Management Console, or one of the command line tools or SDKs to deploy it? Which public AMI did you use? Was it one of the ones provided by Amazon?
Depending on your needs, I would make sure that you use one of the AMIs provided by Amazon, such as Ubuntu, Amazon Linux, CentOS, etc. Here's the links to the docs on AMIs, but you can learn quite a bit by just searching for images. Since you mentioned virtualization types though, I'd suggest reading up briefly on the HVM vs. Paravirtual virtualization types on AWS. Each of the instance types / families uses a certain virtualization type, which is indicated in the chart on this page.
Instance Status Checks
This documentation page covers the instance status checks, which you'll probably want to familiarize yourself with. It's entirely possible that shutting down (not restart, but shutdown) and then starting the instance back up might resolve the instance status check.
Spot Instances - cost savings!
By the way, I'll just mention this since you indicated that you're new to AWS ... if you're just playing around right now, you can save a ton of cost by deploying EC2 Spot Instances, instead of paying the normal, on-demand rates. Depending on current rates, you can save more than 50%, and per-second billing still applies. Although there's the possibility that your EC2 instance could get "interrupted" based on market demand, you can configure your Spot Instance to just "Hibernate" or "Stop" instead of terminating and relaunching. That way, your work is instance state is saved for when it relaunches.
Hope this helps!

1) Use well-known images or contact with the image developer. Perhaps it requires more than one drive or tricky partitioning.
2) make sure you selected proper HVM/PV image according to the instance type.
3) (after checks are passed) make sure the instance has public ip

AWS: None of the Instances are sending data

I'm trying to set up an Elastic Beanstalk application with Amazon Web Services however I'm receiving a load of errors with the message None of the instances are sending data. I've tried deleting the Elastic Beanstalk Application and the EC2 instance several times with the sample application and trying again but I get the same error.
I also tried uploading a flask application with AWS Elastic Beanstalk command line tools but then I received the error below:
Environment health has transitioned from Pending to Severe. 100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago). ELB health is failing or not available for all instances. None of the instances are sending data
Why do I get this error and how do I fix it? Thanks.

You are using Enhanced Health Monitoring.
With enhanced health monitoring an agent installed on your EC2 instance monitors vital system and application level health metrics and sends them directly to Elastic Beanstalk.
When you see an error message like "None of the instances are sending data", it means either the agent on the instance has crashed or it is unable to post data to Elastic Beanstalk due to networking error or some other error.
For debugging this, I would recommend downloading "Full logs" from the AWS console. You can follow the instructions for getting logs in the section "Downloading Bundle Logs from Elastic Beanstalk Console" here.
If you are unable to download logs using the console for any reason you can also ssh to the instance and look at the logs in /var/log.
You will find logs for the health agent in /var/log/healthd/daemon.log.
Additional logs useful for this situation are /var/log/cfn-init.log, /var/log/eb-cfn-init.log and /var/log/eb-activity.log. Can you look at the logs and give more details of the errors you see?
This should hopefully give you more details regarding the error "None of the instances are sending data".
Regarding other health "causes" you are seeing:
Environment health has transitioned from Pending to Severe - This is because initially your environment health status is Pending. If the instances do not go healthy within grace period health status transitions to Severe. In your case since none of the instances is healthy / sending data, the health transitioned to Severe.
100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago).
Elastic Beanstalk monitors other resources in addition to your EC2 instances when using enhanced health monitoring. For example, it monitors cloudwatch metrics for your ELB. This error means that all requests sent to your environment CNAME/load balancer are failing with HTTP 5xx errors. At the same time the request rate is very low only 0.5 requests per minute, so this indicates that even though all requests are failing, the request rate is pretty low. "7 minutes ago" means that information about ELB metrics is slightly old. Because Elastic Beanstalk monitors cloudwatch metrics every few minutes, so the data can be slightly stale. This is as opposed to health data we get directly from the EC2 instances which is "near real time". In your case since the instances are not sending data the only available source for health is ELB metrics which is delayed by about 7 minutes.
ELB health is failing or not available for all instances
Elastic Beanstalk is looking at the health of your ELB, i.e. it is checking how many instances are in service behind ELB. In your case either all instances behind ELB are out of service or the health is not available for some other reason. You should double check that your service role is correctly configured. You can read how to configure service role correctly here or in the documentation. It is possible that your application failed to start.
In your case I would suggest focusing on the first error "None of the instances are sending data". For this you need to look at the logs as outlined above. Let me know what you see in the logs. The agent is started fairly early in the bootstrap process on the instance. So if you see an error like "None of the instances are sending data", it is very likely that bootstrap failed or the agent failed to start for some reason. The logs should tell you more.
Also make sure you are using an instance profile with your environment. Instance profile allows the health agent running on your EC2 instance to authenticate with Elastic Beanstalk. If instance profile is not associated with your environment then the agent will not be able to send data to Elastic Beanstalk. Read more about Instance Profiles with Elastic Beanstalk here.
Update
One common reason for the health cause "None of the instances are sending data" can be that your instance is in a VPC and your VPC does not allow NTP access. Typical indicator of this problem is the following message in /var/log/messages: ntpdate: Synchronizing with time server: [FAILED]. When this happens the clock on your EC2 instance can get out of sync and the data is considered invalid. You should also see a health cause on the instances on the health page on the AWS web console that tells you that instance clock is out-of-sync. The fix is to make sure that your VPC allows access to NTP.

There can be many reasons why the health agent is not able to send any data, so this may not be the answer to your problem, but it was to mine and hopefully can help somebody else:
I got the same error and looking into /var/log/healthd/daemon.log the following was repeatedly reported:
sending message(s) failed: (Aws::Healthd::Errors::GroupNotFoundException) Group 97c30ca2-5eb5-40af-8f9a-eb3074622172 does not exist
This was caused by me making and using an AMI image from an EC2 instance inside an Elastic Beanstalk environment. That is, I created a temporary environment with one instance the same configuration as my production environment, went into the EC2 console and created an image of the instance, terminated the temporary environment, and then created yet another environment using the new custom AMI.
Of course (in hindsight) this meant some settings of the temporary environment were still being used. In this case specifically /etc/healthd/config.yaml, resulting in the health agent trying to send messages to a no longer existing health group.
To fix this and make sure there was no other stale configuration around, I instead started a new EC2 instance by hand from the default AMI used in the production environment (find it under the 'Instances' configuration page of your environment), provision that, then create a new image from that and use that image in my new EB environment.

Check if your instance type's RAM is enough for app + os + amazon tooling. We suffered from this for a long time, when we discovered that t2.micro is barely enough for our use cases. The problem went away right after using t2.small (2GB).

I solved this by adding another security group (the default one for my Elastic Beanstalk).

It appears my problem was that I didn't associate a public ip address to my instance... after I set it it worked just fine.

I was running an app in elastic beanstalk environment with docker as platform. I got the same error that none of the instances are sending. And I was unable fetch logs as well.
Rebuilding the environment worked for me.

I just set the Path on load balancing to a URL that response with status code 200, for this only to study environment.
For my real app, I use actuator

If you see something like this where you don't get any enhanced metrics, check that you haven't accidentally removed the conf.d/elasticbeanstalk/healthd.conf include from your nginx config. This conf adds an machine-read log format that is responsible for reporting that data in EB (see Enhanced health log format - AWS).

My instance profile's IAM Role was lacking elasticbeanstalk:PutInstanceStatistics permission.
I found this by looking at /var/log/healthd/daemon.log as suggested in one of the other answers.
I had to SSH into the machine directly to discover this, as the Get Logs function itself was failing due to missing S3 Write permissions.

If you're running a Worker Tier EB, need to add this policy:
arn:aws:iam::aws:policy/AWSElasticBeanstalkWorkerTier

For anyone arriving here in 2022…
After launching a new environment that was identical to a current healthy environment and seeing no data, I raised an AWS Support ticket. I was informed:
Here, I would like to inform you that recently Elastic Beanstalk introduced new feature called EnhancedHealthAuthEnabled to increase security of your environment and help prevent health data spoofing on your behalf and this option will be enabled by default when you create new environment.
If you use managed policies for your instance profile, this feature is available for your new environment without any further configuration as Elastic Beanstalk instance profile managed policies contain permissions for the elasticbeanstalk:PutInstanceStatistics action. However, If you use a custom instance profile instead of a managed policy, your environment might display a No Data health status. This happens because custom instance profile doesn't PutInstanceStatistics permission by default and instances aren't authorised for the action that communicates enhanced health data to the service. Hence, your environment health shows Unknown/No data status.
The policy that I needed to attach to my existing EC2 role (as advised by AWS Support) looked like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ElasticBeanstalkHealthAccess",
"Action": [
"elasticbeanstalk:PutInstanceStatistics"
],
"Effect": "Allow",
"Resource": [
"arn:aws:elasticbeanstalk:*:*:application/*",
"arn:aws:elasticbeanstalk:*:*:environment/*"
]
}
]
}
Adding this policy to my EC2 role solved the issue for me.

In My case when i increased my ram or instance type(t2.micro to c5.xlarge) it had resolved.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js