AWS ECS Deployment: insufficient memory - amazon-web-services

I have configured an AWS ECS Cluster with 3 instances (m5.large), with one instance across each availability zones (A, B, and C). The Service is configured as follows:
Service type: REPLICA
Number of Tasks: 3
Minimum Healthy Percent: 30
Maximum Percent: 100
Placement Templates: AZ Balanced Spread
Service AutoScaling: No.
In the Task Definition, I have used the following:
Network Mode: awsvpc
Task Memory: --
Task CPU: --
At the container level, I have configured only Memory Soft Limit:
Soft Limit: 2048 MB
Hard Limit: --
I have used awslogs for logging. The above configuration works and when I start the service, there is one docker running in each of the instances. The 'docker stats' in one of the instances shows the following:
MEM USAGE / LIMIT
230MiB / 7.501GiB
And the container instance (ECS Console) shows the following:
Resources Registered Available
CPU 2048 2048
Memory 7680 5632
Ports 5 ports
The above results are the same across all the 3 instances -- 2 GB of memory has been reserved (soft limit) and upper memory limit is instance memory of nearly 8 GB (no hard limit set). Everything works as expected so far.
But when I re-deploy the code (using force deploy) from Jenkins, I get the following error in the Jenkins Log:
"message": "(service App-V1-Service) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance 90d4ba21-4b19-4e31-c42d-d7223b34f17b) has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.
In Jenkins, the job shows up as 'Success', but it is the old version of the code that is running. There is sufficient memory available on all the three instances. Also, the I have changed the Minimum Healthy Percent to 30 hoping that ECS can stop the container and re-delpoy the new one. Any solution or pointers to debug this further will be of great help.

As during deployment, the ECS schedule will allocate memory base on soft limit for each container which can be
2048 * 3 = 6144 MB
which is less than the available memory in the instance
5632 (available memory) < 6144 (required memory)
If you running replica in the same ECS container instance then I will recommend to keep minimum soft limit which should be less or equal to 1GB also this is suggested by ECS as well.
So with this configuration, you will be run blue-green deployment as well. As this nothing harm to keep the soft limit minimum as container can scale to use more memory when it's required so applying some big memory for soft limit does not affect the performance.
I will not recommend lowering the Minimum Healthy Percent: 0 as decrease the soft limit to 1GB will resolve the issue.
Or if you want to keep the same memory limit then decrease Minimum Healthy Percent

Related

AWS ECS Container Agent only registers 1 vCPU but 2 are available

In the AWS console I setup a ECS cluster. I registered a EC2 container instance on a m3.medium, which has 2vCPUs. In the ECS console it says only 1024 CPU units are available.
Is this expected behavior?
Should the m3.medium instance not make 2048 CPU units available for the cluster?
I have been searching the documentation. I find a lot of explanation of how tasks consume and reserve CPU, but nothing about how the container agent contributes to the available CPU.
ECS Screenshot
tldr
In the AWS console I setup a ECS cluster. I registered a EC2 container > instance on a m3.medium, which has 2vCPUs. In the ECS console it says > only 1024 CPU units are available. > > Is this expected behavior?
yes
Should the m3.medium instance not make 2048 CPU units available for
the cluster?
No. m3.medium instances only have 1vCPU (1024 CPU units)
I have been searching the documentation. I find a lot of explanation
of how tasks consume and reserve CPU, but nothing about how the
container agent contributes to the available CPU.
There prob isn't going to be much official documentation on the container agent performance specifically, but my best recommendation is to pay attention to issues,releases,changelogs, etc in the github.com/aws/amazon-ecs-agent and github.com/aws/amazon-ecs-init projects
long version
The m3 instance types are essentially deprecated at this point (not even listed on main instance details page), but you'll be able to see the m3.medium only has 1vCPU in the the details table on kinda hard to find ec2/previous-generations page
Instance Family Instance Type Processor Arch vCPU Memory (GiB) Instance Storage (GB) EBS-optimized Available Network Performance
...
General purpose m3.medium 64-bit 1 3.75 1 x 4 - Moderate
...
According to this knowledge-center/ecs-cpu-allocation article, 1 vCPU is equivalent to 1024 CPU Units as described in
Amazon ECS uses a standard unit of measure for CPU resources called
CPU units. 1024 CPU units is the equivalent of 1 vCPU.
For example, 2048 CPU units is equal to 2 vCPU.
Capacity planning for ECS cluster on an EC2 can be... a journey... and will be highly dependent on your specific workload. You're unlikely to find a "one size" fits all documentation/how-to source but I can recommend the following starting points:
The capacity section of the ECS bestpractices guide
The cluster capacity providers section of the ECS Developers guide
Running on an m3.medium is prob a problem in and of itself since the smallest types i've in the documentation almost are c5.large, r5.large and m5.large which have 2vCPU

ECS Cluster with soft and hard limit

We are having ECS cluster where we are having 1 container namely A (which has soft limit of 412 mb at container level) and ECS agent and another container(Running via userdata) which is hard limited to 512 mb and our instance type is t3a.micro? My doubt is How much memory can ECS agent use?
From documentation, soft limit means to be memory reservation,In this case does that mean 412 mb of memory is reserved only for container A?

service ecs-service was unable to place a task because no container instance met all of its requirements

This error happens sporadically, but when it happens, it blocks other events in the service, like scaling.
ERROR:
service ecs-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance xxx has insufficient CPU units available. For more information, see the Troubleshooting section.
I checked AWS docs:
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-container-instance-requirement-error/
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html#service-event-messages-1
They don't provide enough information.
This is the most relevant info on my Task Definition and Task
Network mode: bridge
Compatibilities: EXTERNAL, EC2
Task size:
Memory: 1024
CPU: 1024
Container definitions:
Hard/Soft memory limits: 1024/--
CPU units: 1024
Is there a way to fix this and avoid these intermittent failures?
Thanks
It is very clearly saying none of the EC2 instances in your cluster have enough CPU available to run this task, based on the tasks already running on those EC2 servers.
You could either configure a capacity provider on your ECS cluster that would add more EC2 instances to your cluster as needed, or you could switch from EC2 deployments to Fargate, which doesn't have this issue.

AWS BATCH - how to run more concurrent jobs

I have just started working with AWS BATCH for my deep learning workload. I have created a compute environment with the following config:
min vCPUs: 0
max vCPUs: 16
Instance type: g4dn family, g3s family, g3 family, p3 family
allocation strategy: BEST_FIT_PROGRESSIVE
The maximum number of vCPU limits for my account is 16 and each of my jobs requires 16GB of memory. I observe that a maximum of 2 jobs can run concurrently at any point in time. I was using allocation strategy: BEST_FIT before and changed it to allocation strategy: BEST_FIT_PROGRESSIVE but I still see that only 2 jobs can run concurrently. This limits the amount of experimentation I can do in a given time. What can I do to increase number of jobs that can run concurrently?
I figured it out myself just now. I'm posting an answer here just in case anyone finds it helpful in the future. It turns out that the instances that were assigned to each of my jobs are g4dn2xlarge. Each of these instances takes up 8 vCPUs. And as my vCPU limit is 16 only 2 jobs can run concurrently. One of the solutions to this is to ask AWS to increase the limit on vCPU by creating a new support case. Another solution could be to modify the compute environment to use GPU instances that consume 4 vCPUs (lowest possible on AWS) and in this case maximum of 4 jobs can run concurrently.
There are 2 kind of solutions:
Configure your compute environment with ec2 instances with vCPUs tha be
multiple of your jobs definitions. For example:
Compute env. with ec2 instance type 8 vCPU and limit up 128 vCPUs of you
have a job definition with 8 vCPU it will let you to execute up to 16
concurrent jobs.Because 16 jobs concurrents X 8 vCPU = 128 vCPUs (take
in count the allocation strategy and memory of your instance which is
important in your job consume memory resources too)
Multi-node parallel jobs, this a very interesting soution because in
this kind of scenario you don't need ec2 instances vCPU that at lest be
multiple of you vCPU used in your Job definition and jobs can be spaned
accross multiple Amazon EC2 instances.

Starting VM instance 'instance-1' failed. Error: Quota 'CPUS' exceeded. Limit: 72.0 in region us-central1

I'm trying to set up a Compute Engine instance on GCP with 96 vCPUs as a test. So I started with the smallest instance possible, prepared my environment for the processing and then edited the instance to this:
n1-highcpu-96 (96 vCPUs, 86.4 GB memory)
But then, when I started the instance, I got this error:
Starting VM instance 'instance-1' failed. Error: Quota 'CPUS' exceeded. Limit: 72.0 in region us-central1.
So, my first question is, is this quota specific to me? Or anyone else who tries the same configuration will face the same error? I'm asking this because I'm using the promotion that GCP provides for new users and I thought this might be a limitation for promotion period.
After doing some research I came across this command to show the quota for different regions:
gcloud compute regions describe us-central1
And I ran it for the complete list of regions. Some regions were limited to 24 and the maximum number I've got was 72.
quotas:
- limit: 72.0
metric: CPUS
usage: 0.0
So, how can I have an instance with 96 cores?
Those limits for vCPUs are the soft limits, the hard limits are specified here.
To be granted one, you must submit a quota increase request for the number of vCPUs you want in the location, justifying in the process, why do you need it.
If the request is approved, you will be allowed to run a 96 vCPU GCE instance in the region. Keep in mind that currently, the only exception is the region southamerica-east1, in which you can only have up to 64 vCPUs.