Amazon AWS micro instance with 100% cpu and unresponsive - amazon-web-services

I've been having a problem with my aws ec2 ubuntu instances, they always have a 100% cpu utilization over certain amount of time (around 8 hours) until i restart it.
The instance is ubuntu server 13.04 and it has a basic LAMP, that's all.
I have a cron job to do a ping every couple minutes to keep a VPN tunnel up, but it shouldn't be doing this.
When it's at 100% cpu utilization i can't ping it, ssh into it or browse it, but it doesn't rejects the connection, it just keep "trying".
Any idea what's the reason behind it? Im guessing it has something to do with Amazon throttling the instance, but it's weird that it has a 100% cpu use over 8 hours.
This is the CPU log of the instance, every other indicator seems normal.
I cant attach images here, so i'm posting a link
100% cpu utilization
EDIT
This happened to me before with other instances, and right now i have an Amazon Linux AMI running at 100% for 4 days straight now, and that one only has tomcat, with no apps deployed. I just realized, its unresponsive, im terminating it.

Author's note, 2019: this post was originally written in 2013, and is about the t1.micro instance type. The current EC2 free tier now allows you to choose either the t1.micro or t2.micro instance class. Unlike the t1.micro's intermittent hard-clamping behavior, the t2.micro runs continuously at full capacity until your CPU credit balance nears depletion, and degrades much more gracefully.
This is the expected behavior. See t1.micro Instances in the EC2 User Guide for Linux Instances.
Note the graphs that say "CPU level limited." I have measured this, and if you consume 100% cpu on a micro instance for more than about 15 seconds, the throttling kicks in and your available cycles drop from 2 ECU to approximately 0.2 ECU (roughly 200MHz) for the next 2-3 minutes, at which point the cycle repeats and you'll be throttled again in just a few seconds if you are still pulling hard on the processor.
During throttle time, you only get ~1/10th of the cycles compared to when you are getting peak performance, because the hypervisor "steals" the rest¹... so you are still going to see that you were using a solid 100%... because you were using all that were available. It doesn't take much to pin a micro to the ceiling. Or floor... so either you are asking too much of the instance class or you have something unexpectedly maxing your CPU.
Establish an SSH connection while the machine is responsive, start "top" running, and then stay connected, so that when it starts to slow down, you already have the tool going that you need to use to find out what's the cpu hog.
¹ hypervisor steals the rest: A common misconception at one time was that the time stolen from EC2 instances by the hypervisor (visible in top and similar utilities) was caused by "noisy neighbors" -- other instances on the same hardware competing for CPU cycles. This is not the cause of stolen cycles. For some older instance families, like the m1, stolen cycles would be seen if AWS had provisioned your instance on a host machine that had faster processors than those specified for the instance class; the cycles were stolen so that the instance had performance matching what you were paying for, rather than the performance of the actual underlying hardware. EC2 instances don't share the physical resourced underlying your virtualized CPU resources.

Run top and see how high st (or steal) is. If st is at 97%, then you are being throttled and only have 3% of your CPU to work with. You don't need to be doing anything CPU intensive for that to be slow!
If that is the case and you cannot change how much CPU you require, the only fix is upgrade to a small instance. Small instances do not have as much throttling.
http://theon.github.io/you-may-want-to-drop-that-ec2-micro-instance.html

Related

Aws load balancer latency stay high 80s forever after a CPU spike event

I have a AWS elastic beanstalk load balancer t2.medium environment set up recently. I noticed that there is a CPU spike event (last a few seconds), following with high latency stay at 80s forever - never auto recover until manual reboot. (see chart)
The new environment is a clone of our old instance which worked fine for years. The CPU spike might be caused by a batch job, but why latency stay high 80s after CPU usage recover? This happened twice in 2 weeks.
The old instance (running the same code) had never behaved like this. Can anyone give some hints?
I checked the log, no other suspicious event.
I am expecting the sever to work as normal when CPU recovers, not hanging. The old instance t2.small runs the same jobs, I had never observed such high latency in any event.

Cloud Memorystore Redis high CPU utilisation

We are using Cloud Memorystore Redis instance to add a caching layer to our mission critical Internet facing application. Total number of calls (including get, set and key expiry operations) to Memorystore instance is around 10-15K per second. CPU utilisation has been consistently around 75-80% and expecting the utilisation to go even higher.
Currently, we are using M4 capacity tier under Standard service tier.
https://cloud.google.com/memorystore/docs/redis/pricing
Need some clarity around the following pointers.
How many CPU cores do the M4 capacity tier correspond to?
Is it really alarming to have more than 100% CPU utilisation? Do we expect any noticeable performance issues?
What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)? Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues.
Our application is really CPU intensive and we don't see any way to further optimize our application. Looking forward to some helpful references.
Addressing your questions.
1. How many CPU cores do the M4 capacity tier correspond to?
Cloud Memorystore for Redis is a Google-managed service which means that Google can reserve the inner details(resources) of the virtual machine that is running the redis service. Still it is expected that the higher the capacity tier, the more resources(CPU) the virtual machine will have. For your case in particular, adding CPUs will not solve issues around CPU usage because redis service itself is single threaded.
As you can see from the previous link:
To maximize CPU usage you can start multiple instances of Redis.
If you want to use multiple CPUs you can start thinking of some way to shard earlier.
2. Is it really alarming to have more than 100% CPU utilisation?
Yes, it is alarming to have high CPU utilization because it can result in connection errors or high latency.
CPU utilization is important but also whether the Redis instance is efficient enough to sustain your throughput at a given latency. You can check the redis latency with the command redis-cli --latency while CPU % is high.
3. Do we expect any noticeable performance issues?
This is really hard to say or predict because it depends on several factor(client service, commands run within a time frame, workload). Some of the most common causes for high latency and performance issues are:
Client VMs or services are overloaded and not consuming the messages from Redis: When a client opens a TCP connection to redis then the redis server has a buffer of messages to send to that connection. If a client service has its CPU maxed out, giving no time for the kernel to receive messages from redis then they fill up on the redis server.
The commands executed are consuming a lot of CPU: The following commands are known to be potentially very expensive to process:
EVAL/EVALSHA
KEYS
LRANGE
ZRANGE/ZREVRANGE
4.-What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)?
This question revolves mainly around the scaling design of your implementation. Since redis is single threaded, a better approach to reduce CPU % would be by sharding your data in multiple redis instances and have a proxy in front of it to distribute the load. Please take a look at the graph under section Twemproxy from this link.
5.-Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues?
Switching to a higher capacity tier should help with the latency temporarily but this is known as vertical scaling which is limited to the tiers that Cloud Memorystore offers.
Redis Enterprise solves all the issues you are facing. Redis Enterprise can be configured in a clustered configuration and utilize all the resources of the machine as well as scale out over multiple machines.
The Redis Enterprise Software is responsible for watching over the CPU utilization and other resource management tasks so you do not need to.
It is offered on GCP and GCP marketplace as well.
https://redis.com/redis-enterprise-cloud/pricing/

Why EC2 instance is not accessible to others

I deployed the Machine Learning classification model in AWS EC2 (UBUNTU)instance successfully. I am able to access the instance "http://ec2-18-191-31-0.us-east-2.compute.amazonaws.com" and predictions are working fine only for few minutes. After that I or my colleagues are not able to access this. Getting an error "cannot connected to the server".
Security group that I crated as attached.
t2.micro instances are not suitable for any long running calculations. They are burstable. This means that their performance can be sustained only for short periods of time, e.g., sudden, short lived spikes in CPU usage. On top of that they have only 1 GB of RAM which limits its usefulness in machine learning.
For calculations, you could consider Compute optimized or Memory optimized instances. Obviously, these instance types are not free, but they are suited for calculations.
You can change instance type if you want and test with other, more power types. What you are describing indicates that your t2.micro exhausts all its RAM and/or CPU burst credits after few minutes and it freezes.
You can use CloudWatch Metrics for EC2 to monitor your instances and observer its CPU utilization and other metrics which can help you determine what exactly is causing the backlog. You can also monitor RAM and disc usage but this requires CloudWatch Agent setup on the instance.

AWS EC2 - Do auto scaled instances run for a minumum amount of time (CPU load average based)

I've been running a scheduler for my work load for awhile now. Recently demand has become more inconsistent, and the workload has been backing up at what should be slow points of the week. I've started implementing auto scale groups in two of my regions that scale based on CPU load.
I've got it set at 80% CPU load average, and my queued work is good at maximizing the CPU, and I opted for more, smaller instances that are cheaper to run. Everything appears to be operating ideally, but I just have a concern about instances being started and stopped too often. I know on EC2 you pay for the full hour regardless of how long it runs during that hour, so...
Is the auto scaling taking this into account and leaving them running for at least a certain amount of time like ~30-45 minutes?
Do I have to instead work with the CPU average and the various timeouts to help prevent wasteful start/stops?
Depending on which AMI you're running, you might benefit from per-second billing. In this case, you'll only be charged a minimum of 60 seconds. From my understanding of your use case, this billing method would be ideal (cost-wise) for you, as you seem to frequently start and stop instances that live for short amounts of time.
To my knowledge, there's no built-in mechanism in autoscaling that will try to optimize your EC2 usage to minimise costs.
If, however, you're using an AMI that is not eligible for per-second billing, you could look into Spot instances to further minimse your costs, if your workload applies to this scheduling model.

Amazon EC2 Upgrade

We are considering upgrading from an t2.micro AWS server instance to a m3.medium instance based on the recommendation here and some research offline. We feel the need to upgrade primarily for speed issues and to ensure google bots crawl our fast growing site fast enough. We have upward of 8000 products (on magento) and that will grow.
While trying to understand what exactly could be the constraint of the current t2.micro instance, we ran through a lot of logs but couldn't find anything specific that could indicate a bottle-neck as such in the current usage.
Could anyone help point out
1. What are the clues that can be found in logs which could show potential bottleneck issues(if-any) with the current t2.micro instance
2. How could we find out if google-bot had issues while crawling and stopped crawling due to server performance related issues.
There are two things to note about t2.micro instances:
They have CPU limitations based upon a CPU credits system
They have limited network bandwidth
CPU credits
The T2 family is very powerful (see comparison between t2.medium and m3.medium), but there is a limit on the amount of CPU that can be used.
From the T2 documentation:
Each T2 instance starts with a healthy initial CPU credit balance and then continuously (at a millisecond-level resolution) receives a set rate of CPU credits per hour, depending on instance size. The accounting process for whether credits are accumulated or spent also happens at a millisecond-level resolution, so you don't have to worry about overspending CPU credits; a short burst of CPU takes a small fraction of a CPU credit.
Therefore, you should look at the CloudWatch CPUCreditBalance metric for the instance to determine whether it has consumed all available credits. If so, then the CPU will be limited to 10% of the time and you either need a larger T2 instance, or you should move away from the T2 family.
In general, T2 instances are great for bursty workloads, where the CPU only spikes at certain times. It is not good for sustained workloads.
Network Bandwidth
Each Amazon EC2 instance type has a limited amount of network bandwidth. This is done to prevent noisy neighbour situations. While AWS only describes bandwidth as Low/Moderate/High, there are some better details at: EC2 Instance Types's EXACT Network Performance?
You can monitor network traffic of your EC2 instances using CloudWatch. Pay attention to NetworkIn and NetworkOut to determine whether the instances are hitting limits.