AWS Instance Scheduler schedule name unknown - amazon-web-services

The application I look after is only used during office hours, so in order to save costs, I've tried to automate the shutdown of servers overnight using AWS Instance Scheduler. I've tagged a couple of EC2 instances and an RDS instance, setting ScheduledServices=Both. However, Instance Scheduler is only shutting down the EC2 instances, while the RDS instance continues to run. I've also tried creating a second Instance Scheduler CloudFormation stack with the same TagName and setting ScheduledServices=RDS, but this doesn't cause the EC2 instance to be shutdown either.
The schedule config is this:
Target all instances with following tag key: ScheduleDowntime and value: Y.
Targeted instances will be started at 07:00:00 and stopped at 19:00:00 on the following days: Monday, Tuesday, Wednesday, Thursday, Friday.
Schedule timezone: (GMT +00:00) Europe/London
The Cloudwatch logs show the following message:
WARNING : Skipping instance xxxx in region eu-west-2 for account
xxxxx, schedule name "Y" is unknown
I've Googled this message, but it appears that others were getting a similar message only when they attempted to use a stop time that is earlier than the start time, and that is not the case for me, as can be seen in the config above.
How can I resolve this issue?

Related

GCP VM instances schedule is not starting the attached instance

Last Friday I've updated daily start/stop schedule for an instance (deleted previous one and created a new one with different timing).
The instance was not changed. It's a preemptible e2-medium instance.
For some reason the schedule did not starting the VM, I also don't see any logs from it.
I did not change the permission, but just to be sure I've confirmed that the Google APIs Service Agent still has the standard Editor permission.
No other changes were made anywhere on this GCP.
I've tried to create other schedules with CRON expressions, different timezones, different instances, tried setting the initiation date. None of this worked.
The schedule zone is us-central, the instance zone is us-central1-a.
I've tried to wait for 15 minutes and more.
The problem was indeed caused by the missing permission. I had to give permission compute.instances.start to the right account
service-<my-gcp-numeric-id>#compute-system.iam.gserviceaccount.com” <- this one
<my-gcp-numeric-id>#cloudservices.gserviceaccount.com. <- not this one
But what's interesting is:
Previously (a year ago) created schedules worked fine.
The above mentioned account (service-<my-gcp-numeric-id>#) is not displayed anywhere, even after I given it persmissions.
When I create schedule on a brand new project it complains about that account missing the permission and doesn't let me attach instances, but in the original case there were no error messages.

AWS EC2 Instance Troubleshooting SSM Agent Ping

all. We manage an EC2 instance (windows) that hosts our on-premises Power BI Gateway. The last few days, we have noticed that we're unable to rdp into the instance. When we check Fleet Manager, we see that "SSM Agent Ping Status" is at Connection Lost:
After an instance reboot, the issue resolves itself. This has happened daily for three days now. I have verified that the ssm agent version is not dated. The current version installed on instance is 3.1.1575, which per the releases doc (below) was released on June 6, 2022 (seems fairly recent). There is one version more recent released 7-14-2022:
https://github.com/aws/amazon-ssm-agent/releases
I looked into the "Connection Lost" status and saw the following:
If an instance fails a health check, AWS OpsWorks Stacks autoheals registered Amazon EC2 instances and changes the status of registered on-premises instances to connection lost. There's some confusion here, I don't recall registering the instance under AWS OpsWorks Stacks -- unless, this is automatic.
Is there anyway to troubleshoot and get to the bottom of this? We can temporarily reboot instance, but I'd like to understand what's causing the issue... Thank you!

ec2 Instance Status Check Failed

I am currently running a process on an ec2 server that needs to run consistently in the background. I tried to login to the server and I continue to get a Network Error: Connection timed out prompt. When I check the instance, I get the following message:
Instance reachability check failed at February 22, 2020 at 11:15:00 PM UTC-5 (1 days, 13 hours and 34 minutes ago)
To troubleshoot, I have tried rebooting the server but that did not correct the problem. How do I correct this and also prevent it from happening again?
An instance status check failure indicates a problem with the
instance, such as:
Failure to boot the operating system
Failure to mount volumes correctly
File system issues
Incompatible drivers
Kernel panic
Severe memory pressures
You can check following for troubleshooting
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesStopping.html
For future reprting and auto recovery you can create a CloudWatch
Alarm
For second part
Nothing you can do to stop its occurrence, but for up-time and availability YES you can create another EC2 and add ALB on the top of both instances which checks the health of instance, so that your users/customers/service might be available during recovery time (from second instance). You can increase number of instances as more as you want for high availability (obviously it involves cost)
I've gone through the same problem
and then once looking at the EC2 dashboard could see that something wasn't right with it
but for me rebooting
and waiting for a 2-3 minutes solved it and then was able to SSH to the instance just fine
If that becomes a recurrent problem, then I'll follow through with Jeremy Thompson's advice
... put the EC2's in an Auto Scaling Group. The ALB does a health check and it fails will no longer route traffic to that EC2, then the ASG will send a Status check and take the unresponding server out of rotation.

Codedeploy taking long time

We trigger autoscaling on a TargetResponseTime threshold. Launch and turning healthy for new EC2 instance takes close to 20 mins. When we check the codedeploy we see two kinds of time, one in Deployment history start time Aug 22, 2019 3:10 PM and end time Aug 22, 2019 3:28 PM. Going to that particular deployment, we see duration as 2 minutes 21 seconds, from ApplicationStop to AfterAllowTraffic. Where is the rest of the time spent? Why is that Deployment history shows 18 mins whereas the deployment time is 2 mins 21 sec?
How can we reduce this time?
Background: To launch EC2 instance by autoscaling we have a launch configuration that installs the codedeploy agent. The instance would be in Pending:Wait state in Auto Scaling Group Instances' lifecycle, with a hook CodeDeploy-managed-automatic-launch-deployment-hook-DGENSVPC1b-f51a955c-194e-4a51-ad9b-1489101325ba
autoscaling:EC2_INSTANCE_LAUNCHING,ABANDON,600
Using Amazon's AMI instead of custom AMI helped reduce this time to ~5-6mins from the previous 20mins.
It's hard to say without more visibility into your system. It could range from your tasks within your EC2 User Data or ELB health check settings. Could you take another look at the different CodeDeploy lifecycle events and see where the time is aggregating? E.g. If you view the specific CodeDeploy action, you can "view events" to see a list of deployment lifecycle events and the time it took to complete each of them. After you find out what's taking the longest time, you can began to narrow down the root cause.

EC2 persistent instance retirement scheduled

On the personal health dashboard in the AWS Console, I've got this notification
EC2 persistent instance retirement scheduled
yesterday which says that one of my ec2 instances is scheduled to retire on 13th March 2019. The status was 'upcoming' while the start and end times both were set to 14-Mar-2019.
The content of the notification starts with:
Hello,
EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-xxxxxxxxxx) associated with your AWS account (AWS Account ID: xxxxxxxxxx) in the xxxx region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2019-03-13 00:00 UTC.
....
I've got yet another notification today for the same instance and with the same subject line but the status has been changed to 'ongoing' and the start time is 27-Feb-2019 while the end time is 14-Mar-2019.
I was planning to do a start-stop of the instance next week but does the second notification tell me to do is ASAP?
Yes, it is better to do stop/start ASAP. Even in your message it says:
Due to this degradation your instance could already be unreachable