AWS AutoScaling CPUUtilization metric not accurate? - amazon-web-services

I do heavy computation on incoming data traffic, similar to web server requests but not exactly. The computation uses mainly cpu. Memory or disk read/write is hardly used at all. I deployed this application to an autoscaling group. I also have some customised measurements on the performance of the system, beside the default AWS AutoScaling CPUUtilization metrics.
The strange thing I found out is, sometimes the default AWS AutoScaling CPUUtilization metrics can be as high as 95%, but my customised measurements show that the system works just fine, as also justified by my visual user check.
I'm quite confident on my customised measurements and I believe that it really shows the true performance. But should I consider the high CPUUtilization in this case as abnormal, and it's just simply not accurate?

Related

Cloudwatch Period time

CPU metrics cannot be selected below 1 minute in Cloudwatch service. For example, how can I lower this period time to trigger the Autoscale scale faster? I just need to trigger the AutoScale instances in short time. (By the way, datapoints value 1 to 1)
the minimum granularity for the metrics that EC2 provides is 1 minute.
Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html
Would also say that if you need to scale that quickly, wouldn't the startup time be an issue anyway?
You are correct -- basic monitoring of an Amazon EC2 instance provides metrics over 5-minute periods. If you activate EC2 Detailed Monitoring, metrics are provided over 1-minute periods. Extra charges apply for Detailed Monitoring.
When launching a new instance via Amazon EC2 Auto-Scaling, it can take a few minutes for the new instance to launch and for the User Data script (if any) to run. Linux instances are quite fast, but Windows instances take a while on their first boot due to sysprep operations.
You mention that you want to react to a metric in less than one minute. I would suggest that this would not be an ideal way to trigger Auto-scaling. Sometimes a computer can be busy for a while, then can drop down again. Reacting too quickly to a high CPU load would cause the Auto-Scaling group to flap between adding instances and terminating instances. It is better to provision enough capacity for a reasonable amount of extra load and then gradually add more capacity as it is required over time.
If you have a need to react so quickly, then perhaps you should investigate using AWS Lambda to perform small amounts of work in a highly-parallel fashion rather than relying on Amazon EC2 instances.

AWS Serverless for Microservices and true "pay-as-you-use"

Premise
I'm trying to come up with the right choice of AWS construct for a containerized microservice (set of microservices in fact) deployment. The application will have an average load of 50% through the day and little to nothing during the night and at very specific times in the day(which is not always pre-determinable) there is a burst of high-volume requests. Also, it's not a super-busy set of microservices ( in other words, 2 instances of 1VCPU and 8GB RAM will just be fine )
The fargate compute option seems to be a better option for this type of a setup, except of course that
When my application has little or no load during the night, I will still be charged for the full 1VCPU and 8GB (which according to me is not true "pay as you use" as I might be using only 0.05 or 0.25 VCPU - hypothetical numbers )
The only way to get around this is to write some redefinition strategies myself: watch Cloudwatch events and recreate the Fargate tasks with lesser VCPU. However, it will have some extra overhead in terms of deployment time (even if I ensure staggered deployments, it still means a 'lot of work' each time there is a material event). Is there a better way to do this or is there a more 'truly' out of the box pay-as-you-use arrangement that can let you consume resources in a range continuously based on what you actually are using at that moment without having to jump through the hoop?
Lastly, the purist in me still cannot reconcile in theory the fact that a microservice isn't a 'task' really and use of a Fargate compute option doesn't sound intuitively right to me even if I could think of a microservice as an extreme case of a task running permanently Costwise, am I better off using EC2 as some options seem to get me a cost that is lesser than Fargate (I'm aware of the additional responsibility in maintaining/patching those EC2 instances )?

Why EC2 instance is not accessible to others

I deployed the Machine Learning classification model in AWS EC2 (UBUNTU)instance successfully. I am able to access the instance "http://ec2-18-191-31-0.us-east-2.compute.amazonaws.com" and predictions are working fine only for few minutes. After that I or my colleagues are not able to access this. Getting an error "cannot connected to the server".
Security group that I crated as attached.
t2.micro instances are not suitable for any long running calculations. They are burstable. This means that their performance can be sustained only for short periods of time, e.g., sudden, short lived spikes in CPU usage. On top of that they have only 1 GB of RAM which limits its usefulness in machine learning.
For calculations, you could consider Compute optimized or Memory optimized instances. Obviously, these instance types are not free, but they are suited for calculations.
You can change instance type if you want and test with other, more power types. What you are describing indicates that your t2.micro exhausts all its RAM and/or CPU burst credits after few minutes and it freezes.
You can use CloudWatch Metrics for EC2 to monitor your instances and observer its CPU utilization and other metrics which can help you determine what exactly is causing the backlog. You can also monitor RAM and disc usage but this requires CloudWatch Agent setup on the instance.

AWS Network out

Our web application has 5 pages (Signin, Dashboard, Map, Devices, Notification)
We have done the load test for this application, and load test script does the following:
Signin and go to Dashboard page
Click Map
Click Devices
Click Notification
We have a basic free plan in AWS.
While performing load test, till about 100 users, we didn’t get any error. please see the below image. We could see NetworkIn, CPUUtilization seems to be normal. But the NetworkOut showed 846K.
But when reach around 114 users, we started getting error in the map page (highlighted in red). During that time, it seems only NetworkOut is high. Please see the below image.
We want to know what is the optimal score for the NetworkOut, If this number is high, is there any way to reduce this number?
Please let me know if you need more information. Thanks in advance for your help.
You are using a t2.micro instance.
This instance type has limitations on CPU that means it is good for bursty workloads, but sustained loads will consume all the available CPU credits. Thus, it might perform poorly under sustained loads over long periods.
The instance also has limited network bandwidth that might impact the throughput of the server. While all Amazon EC2 instances have limited allocations of bandwidth, the t2.micro and t2.nano have particularly low bandwidth allocations. You can see this when copying data to/from the instance and it might be impacting your workloads during testing.
The t2 family, especially at the low-end, is not a good choice for production workloads. It is great for workloads that are sometimes high, but not consistently high. It is also particularly low-cost, but please realise that there are trade-offs for such a low cost.
See:
Amazon EC2 T2 Instances – Amazon Web Services (AWS)
CPU Credits and Baseline Performance for Burstable Performance Instances - Amazon Elastic Compute Cloud
Unlimited Mode for Burstable Performance Instances - Amazon Elastic Compute Cloud
That said, the network throughput showing on the graphs is a result of your application. While the t2 might be limiting the throughput, it is not responsible for the spike on the graph. For that, you will need to investigate the resources being used by the application(s) themselves.
NetworkOut simply refers to volume of outgoing traffic from the instance. You reduce the requests you are sending from this instance to reduce the NetworkOut .So you may need to see which one of click Map, Click Devices and Click Notification is sending traffic outside of the instances. It may not necessarily related only to the number of users but a combination of number of users and application module.

Amazon EC2 Instance Monitoring?

I am in need of a fairly short/simple script to monitor my EC2 instances for Memory and CPU (for now).
After using Get-EC2Instance -Region , it lists all of the instances. from here where can i go?
Cloudwatch is the monitoring tool for AWS instances. While it can support custom metrics, by default it only measures what the hypervisor can see for your instance.
CPU utilization is supported by default, this is often a more accurate way to see your true CPU utilization since the value comes from the hypervisor.
Memory utilization however is not. This depends largely on your OS and is not visible to the hypervisor. However, you can set up a script that will report this metric to Cloudwatch. Some scripts to help you do this are here: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/mon-scripts-perl.html
There are a few possibilities for monitoring EC2 instances.
Nagios - http://www.nagios.com/solutions/aws-monitoring
StackDriver - http://www.stackdriver.com/
CopperEgg - http://copperegg.com/aws/
But my favorite is Datadog - http://www.datadoghq.com/ - (not just because I work here, but its important to disclose I do work for Datadog.) 5 hosts or less is free and I bet you can be up and running in less than 5 minutes.
Depends what your requirements are for service availability of the monitoring solution itself, as well as how you want to be alerted about host/service notifications.
Nagios, Icinga etc... will allow you to customise an extremely large number of parameters that can be passed to your EC2 hosts, specifying exactly what you want to monitor or check up on. You can run any of the default (or custom) scripts which then feed data back to a central system, then handle those notifications however you want (i.e. send an email, SMS, execute an arbitrary script). Downside of this approach is that you need to self-manage your backend for all of the aggregated monitoring data.
The CloudWatch approach means your instances can push metric data into AWS, then define custom policies around thresholds. For example, 90% CPU usage for more than 5 minutes on an instance or ASG, which might then push a message out to your email via SNS (Simple Notification Service). This method reduces the amount of backend components to manage/maintain, but lacks the extreme customisation abilities of self-hosted monitoring platforms.