Discrepancy between CPU utilization of ECS service and EC2 instance - amazon-web-services

I'm seeing some discrepancy between ECS service and EC2 in terms of CPU utilization metrics.
We have an EC2 instance of t2 small type and two different ECS containers running inside it. I have allocated 512 CPU units for one container and 128 CPU units for the other container. Here, the problem is that the CPU utilization goes up to > 90% and the following is the screenshot,
While the CPU utilization of underlying EC2 is not even greater than 40% and the following is the screenshot,
What could be the reason for this discrepancy? What could have been gone wrong?

Well, if you assign CPU units to your containers, CloudWatch will report the CPU usage in relation to the available CPU capacity. Your container with 512 CPU units has access to 0.5 vCPUs and the one with 128 Units has access to 0.125 of a vCPU, which is not a lot, so a high utilization of those is easy to achieve.
Since the CPU utilization of the t2.small, which has about 1 vCPU (ignoring the credit/bursting system for now) is hovering around 20%, my guess is that the first graph is from the smaller container.

Related

Get mem and cpu usage from AWS fargate task

What APIs are available for tasks running under ECS fargate service to get their own memory and CPU usage?
My use case is for load shedding / adjusting: task is an executor which retrieves work items from a queue and processes in parallel. If load is low it should take on more tasks, if high - shed or take on less tasks.
You can look at Cloudwatch Container Insights. Container Insights reports CPU utilization relative to instance capacity. So if the container is using only 0.2 vCPU on an instance with 2 CPUs and nothing else is running on the instance, then the CPU utilization will only be reported as 10%.
Whereas Average CPU utilization is based on the ratio of CPU utilization relative to reservation. So if the container reserves 0.25 vCPU and it's actually using 0.2 vCPU, then the average CPU utilization (assuming a single task) is 80%. More details about the ECS metrics can be found here
You can get those metrics in CloudWatch by enabling Container Insights. Note that there is an added cost for enabling that.

AWS ECS Container Agent only registers 1 vCPU but 2 are available

In the AWS console I setup a ECS cluster. I registered a EC2 container instance on a m3.medium, which has 2vCPUs. In the ECS console it says only 1024 CPU units are available.
Is this expected behavior?
Should the m3.medium instance not make 2048 CPU units available for the cluster?
I have been searching the documentation. I find a lot of explanation of how tasks consume and reserve CPU, but nothing about how the container agent contributes to the available CPU.
ECS Screenshot
tldr
In the AWS console I setup a ECS cluster. I registered a EC2 container > instance on a m3.medium, which has 2vCPUs. In the ECS console it says > only 1024 CPU units are available. > > Is this expected behavior?
yes
Should the m3.medium instance not make 2048 CPU units available for
the cluster?
No. m3.medium instances only have 1vCPU (1024 CPU units)
I have been searching the documentation. I find a lot of explanation
of how tasks consume and reserve CPU, but nothing about how the
container agent contributes to the available CPU.
There prob isn't going to be much official documentation on the container agent performance specifically, but my best recommendation is to pay attention to issues,releases,changelogs, etc in the github.com/aws/amazon-ecs-agent and github.com/aws/amazon-ecs-init projects
long version
The m3 instance types are essentially deprecated at this point (not even listed on main instance details page), but you'll be able to see the m3.medium only has 1vCPU in the the details table on kinda hard to find ec2/previous-generations page
Instance Family Instance Type Processor Arch vCPU Memory (GiB) Instance Storage (GB) EBS-optimized Available Network Performance
...
General purpose m3.medium 64-bit 1 3.75 1 x 4 - Moderate
...
According to this knowledge-center/ecs-cpu-allocation article, 1 vCPU is equivalent to 1024 CPU Units as described in
Amazon ECS uses a standard unit of measure for CPU resources called
CPU units. 1024 CPU units is the equivalent of 1 vCPU.
For example, 2048 CPU units is equal to 2 vCPU.
Capacity planning for ECS cluster on an EC2 can be... a journey... and will be highly dependent on your specific workload. You're unlikely to find a "one size" fits all documentation/how-to source but I can recommend the following starting points:
The capacity section of the ECS bestpractices guide
The cluster capacity providers section of the ECS Developers guide
Running on an m3.medium is prob a problem in and of itself since the smallest types i've in the documentation almost are c5.large, r5.large and m5.large which have 2vCPU

Will upgrading my AWS EC2 t3.small instance to t3.medium give better CPU performance?

I'm currently using an EC2 t3.small instance (unlimited mode) for a service that is CPU bound and usually runs at 60-90% of max CPU utilization, but occasionally hits 100%. I want to upgrade to an instance with greater CPU performance so I don't hit max CPU utilization.
My question is: will upgrading to a larger t3 instance (e.g. t3.medium) give more CPU performance? From what I've been able to discern, the t3.medium still has 2 vCPUs just like t3.small. So will it give the same performance, just with more RAM and a higher baseline utilization/more CPU credits? In that case, the t3.xlarge with 4 vCPUs seems to be the next upgrade in the t3 family that gives more CPU capacity. But if that is the case, it seems like a c5.large would be the cheaper upgrade step.
Is my thinking correct here or am I missing something?
After experimenting with different instance types, I came to the following conclusions:
Tried the c5a.large instance. It had very similar CPU performance to a t3.small instance. Still hitting the 100% utilization cap.
c5.large instance: Slightly better performance than c5a.large, but still hitting the CPU cap.
t3a.xlarge instance: with 4 vCPUs, I had much more CPU capacity and did not hit 100% CPU. I was using about 30 CPU credits per hour above the baseline, so I ended up with...
t3.xlarge: slightly better performance than t3a.xlarge and therefore using fewer CPU credits (about 15 per hour) which made this more cost effective than a t3a.xlarge.
So, in conclusion, a t3.medium was indeed not a suitable upgrade over a t3.small for increased CPU capacity. A c5.large was marginally better, but the real upgrade solution was going to a t3a.xlarge or t3.xlarge.

AWS EC2 instance cost far above estimate, why?

I have a script that I run 24/7 that uses 90-100% CPU constantly. I am running this script in multiple virtual machines from Google Cloud Platform. I run one script per VM.
I am trying to reduce cost by using AWS EC2. I looked at the price per hour of t3-micro (2 vCPU) instances and it says the cost is around $0.01/h, which is cheaper than the GCP's equivalent instance with 2 vCPU.
Now, I tried to run the script in one t3-micro instance, just to have a real estimate of how much each t3-instance running my script will cost. I was expecting the monthly cost per instance to be ~$7.20 (720h/month * $0.01/h). The thing is that I have been running the script for 2-3 days, and the cost reports already show a cost of more than $4.
I am trying to understand why the cost is so far from my estimate (and from AWS monthly calculator's estimate). All these extra cost seem to be from "EC2 Other" and "CPU Credit", but I don't understand these costs.
I suspect these come from my 24-7 full CPU usage, but could someone explain what are these costs and if there is a way to reduce them?
The EC2 instance allows a certain baseline CPU usage: 10% for a t3.micro. When the instance is operating below that threshold it accumulates vCPU credits: which are applied to usage above the threshold. A t3.micro can accumulate up to 12 credits an hour (with one credit being equal to 100% CPU ulitilisation for 1 minute). If you are regularly using more CPU credits than the instance allows will be charged at a higher rate: which I understand to be 5c per vCPU hour.
It may be that t3.micro is not your best choice for that type of workload and you may need to select a different instance type or a bigger instance.
The purple in your chart is CPU credits, not instance usage.
Looks like you enabled “T2/T3 Unlimited” when launching your instance and your script is causing it to bursting beyond the provided capacity. When you burst beyond the baseline capacity, you’re charged for that usage at the prevailing rate. You can read more about T2/T3 Unlimited and burstable performance here.
To bring these costs down, disable T2/T3 unlimited by following instructions here.

When should I use a t2.medium vs. a m3.medium instance type within AWS?

They appear to be approximately the same in terms of performance.
Model vCPU Mem (GiB) SSD Storage (GB)
m3.medium 1 3.75 1 x 4
Model vCPU CPU Credits / hour Mem (GiB) Storage
t2.medium 2 24 4 EBS-Only
t2.medium allows for burst-able performance whereas m3.medium doesn't. t2.medium even has more vCPU (1 vs 2) and memory (3.75 vs 4) than the m3.medium. The only performance gain is the SSD w/a m3.medium, which I recognize could be significant if I'm doing heavy I/O.
Would this be the only scenario where I would choose an m3.medium over a t2.medium?
I'd like to run a web server that gets 20-30k hits a month so I suspect either is okay for my needs, but what's the better option?
30000 hits per month is on average a visitor every 90 seconds. Unless your site is highly atypical, load on the server is likely to be invisibly small. Bursting will handle spikes up to hundreds (or thousands, with some optimizations) of visitors.
With appropriate caching, a VPS server of comparable specs to a t2.micro can serve a Wordpress blog with 30000 hits PER MINUTE. If you were saturating that continuously, you couldn't rely on burst performance for the t2.micro, of course. A t2.medium is roughly 4x as powerful in all regards as a micro, and a m3.medium has similar RAM and bandwidth but less peak CPU.
The instance storage will be a few times faster than a large EBS GP2 (SSD) volume on the m3.medium, of course. The t2 & c3 medium instances will both have roughly 300-400 Mbit/s network bandwidth, t2.micro gets ~60-70 Mbit.
One benchmark shows that t2.medium in bursting mode actually beats a c3.large (let alone the m3.medium, which is less than half as powerful, at 3 ECU vs 7).
But as noted, you can probably save money by using something less powerful than either of your suggestions and still have excellent performance.
If you don't need the power to completely configure your server, shared hosting or a platform-as-a-service solution will be easier. I recommend OpenShift, because they explicitly suggest a single small gear for up to 50k hits a month. You get 3 of those for free.
If you do need to configure the server, you really only need enough memory to run your server and/or DB. A t2.nano has 512 MB, and a t2.micro has 1 GB. The real performance bottlenecks will probably be disk I/O and network bandwidth. The first can be improved with a larger general-purpose SSD volume (more IOPS), the second by using multiple instances and an ELB.
Make sure you host all static assets in S3 and use caching well, and even the smaller AWS instances can handle hundreds of requests per second.
Basically: "don't worry about it, use the cheapest and easiest thing that will run it."
Although the "hardware" specs look similar for the T2.medium instance and the M3.medium instance, the difference is when you consider Burstable vs. Fixed Performance. See this link from Amazon Web Services:
http://aws.amazon.com/ec2/faqs/#burst
The following quote comes from that link:
Q: When should I choose a Burstable Performance Instance, such as T2?
Workloads ideal for Burstable Performance Instances (e.g. web servers, developer environments, and small databases) don’t use the full CPU often or consistently, but occasionally need to burst. If your application requires sustained high CPU performance, we recommend our Fixed Performance Instances, such as M3, C3, and R3.
A T2 instance accrues CPU credits, but only as long as it runs. If it is stopped or terminated, the credits accrued are gone.
There is an important piece of information further down the page concerning the CPU credits for the T2 instances:
Q: What happens to CPU performance if my T2 instance is running low on credits (CPU Credit balance is near zero)?
If your T2 instance has a zero CPU Credit balance, performance will remain at baseline CPU performance. For example, the t2.micro provides baseline CPU performance of 10% of a physical CPU core. If your instance’s CPU Credit balance is approaching zero, CPU performance will be lowered to baseline performance over a 15-minute interval.
This means if you run out of burstable credits, your performance will be limited to a fixed percentage of a single core until you accrue more; 10% for T2.micro, 20% for T2.small, and 40% for T2.medium.
Another important difference that the OP mentions is the M3.medium instance can be provisioned with 4GB of ephemeral storage, which has much greater I/O capacity than persistent, Elastic Block Storage (EBS). T2 instances do not have this option.
Finally, it depends on what a "hit" is. In my opinion, if a hit means a few static page downloads that are less than 64k or small dynamic pages, then I'd explore the T2 option. For longer sessions, more data traffic, or higher numbers of concurrent users, I'd consider the M3. And if performance over an extended time period is a key issue, I think you're definitely in M3 land.
Look at the logs for your present site or a site similar to what you're setting up and determine which situation you're in.
Benchmark your application on both and determine the right fit for you. That's the only way to know for sure. The "better option" is dependent on how your application runs and your cost requirements.
Alternatively, you could simply choose one, based on cost or other criteria, and if it's insufficient, or overly sufficient, then change the instance type to the other.