AWS EC2 Instance starting time - amazon-web-services

Sometimes the starting time of the instance takes more than 5 minutes. In this case, the Status Checks takes more than 4 minutes.
How can I make the instance run less than a minute, including checking the status?

You do not need to wait for the Instance Status Check to complete before using an Amazon EC2 instance.
Linux instances are frequently ready 60-90 seconds after launch. Windows instances take considerably longer because the AMI has been configured for sysprep, which involves a reboot.
New instances take longer to be ready than existing instances because they typically run code on first startup. So, if you Stop and instance and later Start it again, the instance will be available quite quickly (especially Linux instances).

I'm not sure that "You do not need to wait for the Instance Status Check to complete" is
correct, and if the status check failed for any reason you (obviously) have a problem and should investigate before using.
Doing a quick check using a aws jdk script creating a "Nano" instance from a linux image loaded with ubuntu, apache, tomcat, java, mysql etc it took 45 secs to get "running" and 2 mins 15 secs to finish the Status Checks.
Starting an existing "stopped" instance ("Nano") took 18 secs and 2 mins 15 secs to finish the status checks.

You can't change instance health check it manages by aws. When a system status check fails, you can choose to wait for AWS to fix the issue, or you can resolve it yourself by stop and start the instance. which in most cases migrates it to a new host computer.
The following are examples of problems that can cause system status checks to fail:
Loss of network connectivity, Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network reachability.
The instance will be accessible once boot. it should not take 5 min of time. you can check instance boot logs or screen from
ec2 --> Action --> Instance settings`Get system log` and `Get instance screenshot` and optimized instance up time.

Related

VMware Tanzu (former PCF) App Autoscaler force scale-down?

I am autoscaling my application based on the HTTP throughput.
My question here is when it reaches min threshold it tries to reduce the instance created. But during reducing the instance count if my instance is running or it is processing prev HTTP request.
In this case, it will wait till the processing completes or it forcibly reduces the instance count when reached threshold.
I have the same question and as far as I understood from App Container Lifecycle it’s up to your app to gracefully shutdown but that might not be possible in given 10 seconds as some processes might take longer.
Shutdown
CF requests a shutdown of your app instance in the following scenarios:
When a user runs cf scale, cf stop, cf push, cf delete, or cf restart-app-instance
As a result of a system event, such as the replacement procedure during Diego Cell evacuation or when an app instance stops because of a failed health check probe
To shut down the app, CF sends the app process in the container a SIGTERM. By default, the process has ten seconds to shut down gracefully. If the process has not exited after ten seconds, CF sends a SIGKILL.
By default, apps must finish their in-flight jobs within ten seconds of receiving the SIGTERM before CF terminates the app with a SIGKILL. For instance, a web app must finish processing existing requests and stop accepting new requests.
Note: One exception to the cases mentioned above is when monit restarts a crashed Diego Cell rep or Garden server. In this case, CF immediately stops the apps that are still running using SIGKILL.
"In this case it will wait till the processing completes or it
forcibly reduces the instance count when reached threshold."
Answer:
No, the App Autoscaler will not force anything, after the decision cycle, it will prepare the instance to be escalated-down (shutdown), so the intention is to avoid lose requests or data during this process.
Please, take a look into the documentation below, it will help you to understand better the App Autoscaler mechanism.
How App Autoscaler Determines When to Scale:
Every 35 seconds, App Autoscaler makes a decision about whether to
scale up, scale down, or keep the same number of instances.
To make a scaling decision, App Autoscaler averages the values of a
given metric for the most recent 120 seconds.
The following diagram provides an example of how App Autoscaler makes scaling decisions:
Reference:
VMWare Tanzu App Autoscaler documentation
VMWare Tanzu is the former Pivotal Cloud Foundry (PCF).

AWS EC2: cannot get bare metal instance

I have tried several times in the last two weeks to log on to a c5.metal instance. Each time I get "Initializing" in the status checks field, but after 10 minutes it is still "Initializing" and I'm not able to log on. I have had success with c5.metal before, but not any more.
Today I also tried to get an m5.metal instance. This time the instance successfully initialized after 10 minutes but I was not able to log on with Putty. I stopped the instance, then after about 30 minutes I tried again and this time I did not get past "Initializing" in the status check field and I stopped it after 15 minutes.
I get billed for the 10 to 15 minute bare metal wait periods, even when initialization doesn't complete. I have no problems with AWS virtual instances.
Thanks for any ideas on what I can do to get the bare metal instances to work.
To reproduce your situation, I did the following:
Launched an Amazon EC2 instance in Ohio:
Instance Type: c5.metal
AMI: Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
Network: In my Default VPC so that it uses a Public Subnet
Security Group: Default settings, which grants port 22 access from the Internet
Instance entered running state very quickly, Status Checks showed as Initializing
It took about 8 minutes until the status checks were showing 2/2 checks (it might have been faster, but I was testing other things in the meantime).
I was able to successfully login to the instance:
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-1065-aws x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Sat Jun 6 23:21:18 UTC 2020
System load: 0.02 Processes: 924
Usage of /: 13.7% of 7.69GB Users logged in: 0
Memory usage: 0% IP address for enp125s0: 172.31.9.77
Swap usage: 0%
0 packages can be updated.
0 updates are security updates.
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
ubuntu#ip-172-31-9-77:~$
(Actually, I first tried to login as ec2-user and it took me a while to realize this was an Ubuntu AMI, so I connected as ubuntu).
It is possible that the slow startup is due to the Operating System or hardware checking the 192GB of RAM that is allocated to the instance.
I booted another instance using an Amazon Linux 2 AMI and it required approximately 7 minutes before I could connect.
I also noticed that the c5.metal instances did not provide anything for "Get System Log" or "Get Instance Screenshot". This might be a result of using a bare-metal instance.
I joined John Rotenstein's twitch.tv channel and he showed how he got a c5.metal instance. What I learned is that if a metal instance does not work in the region you had chosen, try launching a new instance in a different data center region. For example, I had a c5.metal instance at us-east-2a. Following John's directions, I launched an instance at us-east-2c and after about 8 minutes the instance was ready for use.

Why is AWS EC2 CPU usage shooting up to 100% momentarily from IOWait?

I have a large web-based application running in AWS with numerous EC2 instances. Occasionally -- about twice or thrice per week -- I receive an alarm notification from my Sensu monitoring system notifying me that one of my instances has hit 100% CPU.
This is the notification:
CheckCPU TOTAL WARNING: total=100.0 user=0.0 nice=0.0 system=0.0 idle=25.0 iowait=100.0 irq=0.0 softirq=0.0 steal=0.0 guest=0.0
Host: my_host_name
Timestamp: 2016-09-28 13:38:57 +0000
Address: XX.XX.XX.XX
Check Name: check-cpu-usage
Command: /etc/sensu/plugins/check-cpu.rb -w 70 -c 90
Status: 1
Occurrences: 1
This seems to be a momentary occurrence and the CPU goes back down to normal levels within seconds. So it seems like something not to get too worried about. But I'm still curious why it is happening. Notice that the CPU is taken up with the 100% IOWaits.
FYI, Amazon's monitoring system doesn't notice this blip. See the images below showing the CPU & IOlevels at 13:38
Interestingly, AWS says tells me that this instance will be retired soon. Might that be the two be related?
AWS is only displaying a 5 minute period, and it looks like your CPU check is set to send alarms after a single occurrence. If your CPU check's interval is less than 5 minutes, the AWS console may be rolling up the average to mask the actual CPU spike.
I'd recommend narrowing down the AWS monitoring console to a smaller period to see if you see the spike there.
I would add this as comment, but I have no reputation to do so.
I have noticed my ec2 instances have been doing this, but for far longer and after apt-get update + upgrade.
I tough it was an Apache thing, then started using Nginx in a new instance to test, and it just did it, run apt-get a few hours ago, then came back to find the instance using full cpu - for hours! Good thing it is just a test machine, but I wonder what is wrong with ubuntu/apt-get that might have cause this. From now on I guess I will have to reboot the machine after apt-get as it seems to be the only way to put it back to normal.

How to determine that a jvm app does more GC than normal work?

We recently had a problem that our EC2 instances had 90-100 percent cpu load cause of a bug in a library we include that created to many objects instead of reusing them (which was easy solvable), so we spent too much time in GC.
Unfortunately the AWS health checks and instance status metrics didn't cause the overloaded instances to be stopped and then new ones restarted, so after some time we hit the max autoscaling number and....died. Also our own health checks inside the app which are used for the ELB are so simple that they answered often enough to obviously not cause the instances to be terminated...and restarted, which would mitigate that problem for quite some time.
My idea is now to use our custom health check which is already included in the ELB health checks to report a failure if we spent to much time in GC.
How would I do such a thing inside the app?
There are a number of JVM parameters that allow GC monitoring
-Xloggc:<file> // logs gc activity to a file
-XX:+PrintGCDetails // tells you how different generations are impacted
You can either parse these logs yourself or use specific tool such as GCViewer to analyse gc activity.
Use GarbageCollectorMXBean:
long gcTime = 0;
for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
gcTime += gcBean.getCollectionTime();
}
long jvmUptime = ManagementFactory.getRuntimeMXBean().getUptime();
System.out.println("GC ratio: " + (100 * gcTime / jvmUptime) + "%");
You can use VisualVM to monitor what happens inside the JVM and you can monitor remote instances via JMX. You did not describe which application container that you are using (Apache Tomcat, GlassFish etc.), you can set up a JMX connector like this in the case of Tomcat.
Don't forget to adjust Security Groups in AWS to have the proper permission to access the JMX port.
The JVM flags PrintGCApplicationConcurrentTime and PrintGCApplicationStoppedTime will log how long the application was active or suspended. They're a bit of a misnomers since they actually measure time spent in and out of safepoints, not just GCs.

several minute delay after ssh connects before AWS Status Checks report success

We're trying reduce our AWS instance start up time. We're able to ssh to an instance about 90 seconds after it starts. But the Status Checks returns "initializing" until the instance has been running for over four minutes. During that time I don't see the instance doing anything (top, vmstat, uptime all show the system basically idle).
Can anyone tell me how often the Status Check is run during instance start and what specifically its testing? Thanks.
Status Checks are performed every 60 seconds. And it performs different network tests along with making sure that an instance is started and configured properly.
You can read more on that in here:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html