AWS ECS - Cluster Utilization - amazon-web-services

According to aws doc, ECS Cluster CPU is calculated as follows.
Cluster CPU utilization = (Total CPU units used by tasks in cluster) x 100 / (Total CPU units registered by container instances in cluster)
[ https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html ]
There are currently four container instances connected to one ECS Cluster.
The Registered CPU for the container instance is 8192(8vCPU). At this time, is the CPU calculation formula correct?
Cluster CPU utilization = Total CPU units used by tasks in four cotainer instances x 100 / 8192 x 4
please answer about my question.

There are cpu reservation and utilization, don't confuse with both things. you'r reserving 32vcpu for 4 containers, for example assume there are 64vcpu with 4 conatiner instances in entire cluster and if each container is utilizing 4vcpu then you cluster cpu utilization will be 25%
Here is calculation
4 conatainers utilizing each 4vcpu = 16
total cluster cpus= 64 then 64/16=25%

Related

Get mem and cpu usage from AWS fargate task

What APIs are available for tasks running under ECS fargate service to get their own memory and CPU usage?
My use case is for load shedding / adjusting: task is an executor which retrieves work items from a queue and processes in parallel. If load is low it should take on more tasks, if high - shed or take on less tasks.
You can look at Cloudwatch Container Insights. Container Insights reports CPU utilization relative to instance capacity. So if the container is using only 0.2 vCPU on an instance with 2 CPUs and nothing else is running on the instance, then the CPU utilization will only be reported as 10%.
Whereas Average CPU utilization is based on the ratio of CPU utilization relative to reservation. So if the container reserves 0.25 vCPU and it's actually using 0.2 vCPU, then the average CPU utilization (assuming a single task) is 80%. More details about the ECS metrics can be found here
You can get those metrics in CloudWatch by enabling Container Insights. Note that there is an added cost for enabling that.

AWS ECS Container Agent only registers 1 vCPU but 2 are available

In the AWS console I setup a ECS cluster. I registered a EC2 container instance on a m3.medium, which has 2vCPUs. In the ECS console it says only 1024 CPU units are available.
Is this expected behavior?
Should the m3.medium instance not make 2048 CPU units available for the cluster?
I have been searching the documentation. I find a lot of explanation of how tasks consume and reserve CPU, but nothing about how the container agent contributes to the available CPU.
ECS Screenshot
tldr
In the AWS console I setup a ECS cluster. I registered a EC2 container > instance on a m3.medium, which has 2vCPUs. In the ECS console it says > only 1024 CPU units are available. > > Is this expected behavior?
yes
Should the m3.medium instance not make 2048 CPU units available for
the cluster?
No. m3.medium instances only have 1vCPU (1024 CPU units)
I have been searching the documentation. I find a lot of explanation
of how tasks consume and reserve CPU, but nothing about how the
container agent contributes to the available CPU.
There prob isn't going to be much official documentation on the container agent performance specifically, but my best recommendation is to pay attention to issues,releases,changelogs, etc in the github.com/aws/amazon-ecs-agent and github.com/aws/amazon-ecs-init projects
long version
The m3 instance types are essentially deprecated at this point (not even listed on main instance details page), but you'll be able to see the m3.medium only has 1vCPU in the the details table on kinda hard to find ec2/previous-generations page
Instance Family Instance Type Processor Arch vCPU Memory (GiB) Instance Storage (GB) EBS-optimized Available Network Performance
...
General purpose m3.medium 64-bit 1 3.75 1 x 4 - Moderate
...
According to this knowledge-center/ecs-cpu-allocation article, 1 vCPU is equivalent to 1024 CPU Units as described in
Amazon ECS uses a standard unit of measure for CPU resources called
CPU units. 1024 CPU units is the equivalent of 1 vCPU.
For example, 2048 CPU units is equal to 2 vCPU.
Capacity planning for ECS cluster on an EC2 can be... a journey... and will be highly dependent on your specific workload. You're unlikely to find a "one size" fits all documentation/how-to source but I can recommend the following starting points:
The capacity section of the ECS bestpractices guide
The cluster capacity providers section of the ECS Developers guide
Running on an m3.medium is prob a problem in and of itself since the smallest types i've in the documentation almost are c5.large, r5.large and m5.large which have 2vCPU

Discrepancy between CPU utilization of ECS service and EC2 instance

I'm seeing some discrepancy between ECS service and EC2 in terms of CPU utilization metrics.
We have an EC2 instance of t2 small type and two different ECS containers running inside it. I have allocated 512 CPU units for one container and 128 CPU units for the other container. Here, the problem is that the CPU utilization goes up to > 90% and the following is the screenshot,
While the CPU utilization of underlying EC2 is not even greater than 40% and the following is the screenshot,
What could be the reason for this discrepancy? What could have been gone wrong?
Well, if you assign CPU units to your containers, CloudWatch will report the CPU usage in relation to the available CPU capacity. Your container with 512 CPU units has access to 0.5 vCPUs and the one with 128 Units has access to 0.125 of a vCPU, which is not a lot, so a high utilization of those is easy to achieve.
Since the CPU utilization of the t2.small, which has about 1 vCPU (ignoring the credit/bursting system for now) is hovering around 20%, my guess is that the first graph is from the smaller container.

AWS BATCH - how to run more concurrent jobs

I have just started working with AWS BATCH for my deep learning workload. I have created a compute environment with the following config:
min vCPUs: 0
max vCPUs: 16
Instance type: g4dn family, g3s family, g3 family, p3 family
allocation strategy: BEST_FIT_PROGRESSIVE
The maximum number of vCPU limits for my account is 16 and each of my jobs requires 16GB of memory. I observe that a maximum of 2 jobs can run concurrently at any point in time. I was using allocation strategy: BEST_FIT before and changed it to allocation strategy: BEST_FIT_PROGRESSIVE but I still see that only 2 jobs can run concurrently. This limits the amount of experimentation I can do in a given time. What can I do to increase number of jobs that can run concurrently?
I figured it out myself just now. I'm posting an answer here just in case anyone finds it helpful in the future. It turns out that the instances that were assigned to each of my jobs are g4dn2xlarge. Each of these instances takes up 8 vCPUs. And as my vCPU limit is 16 only 2 jobs can run concurrently. One of the solutions to this is to ask AWS to increase the limit on vCPU by creating a new support case. Another solution could be to modify the compute environment to use GPU instances that consume 4 vCPUs (lowest possible on AWS) and in this case maximum of 4 jobs can run concurrently.
There are 2 kind of solutions:
Configure your compute environment with ec2 instances with vCPUs tha be
multiple of your jobs definitions. For example:
Compute env. with ec2 instance type 8 vCPU and limit up 128 vCPUs of you
have a job definition with 8 vCPU it will let you to execute up to 16
concurrent jobs.Because 16 jobs concurrents X 8 vCPU = 128 vCPUs (take
in count the allocation strategy and memory of your instance which is
important in your job consume memory resources too)
Multi-node parallel jobs, this a very interesting soution because in
this kind of scenario you don't need ec2 instances vCPU that at lest be
multiple of you vCPU used in your Job definition and jobs can be spaned
accross multiple Amazon EC2 instances.

AWS AutoScaling Policy on CPU Utilization on the entire group or on individual EC2

We are looking for the threshold behaviours on Scaling Policies. When we setup a the thresholds like this for example
An AutoScaling group set under an ELB. And min and max instances are 4 and 8.
Average of CPU Utilization <65% for 2 consecutive 300 seconds spin down an instance.
Average of CPU Utilization >80% for 2 consecutive 300 seconds spin up an instance.
What is the behaviour here?
Is the CPU Utilization is checked over among all the instances in the group or the utilization individual instances?
Or is it the utilization of the ELB that is placed on the AutoScaling group?
Thanks in advance.
Regards,
Atchut.
The CPU utilization calculated is Average CPU utilization for all the EC2 instances. So if there are say 5 EC2 instances attached, the Autoscaling group will check the average CPU utilization and trigger the scaling policy based on calculated average.