Simulating and Testing Fargate Spot with ECS

Simulating and Testing Fargate Spot with ECS - amazon-web-services

Recently I've been looking in to Fargate Spot with ECS in more detail and trying to understand the capacity providers in ECS a little better. I'm struggling to understand some of the details and I'm struggling to test some scenarios.
I'm trying to understand what would happen if you have a capacity provider that looks like the below if Fargate Spot capacity is unavailable?
I understand that it will launch 6 tasks using Fargate and then allocate additional tasks using Fargate Spot.
What if there is no Fargate Spot capacity available? What would happen?
From what I can see online, there is no failover between capacity providers. Is this correct?
Is there a way to simulate spot not being available?

There isn't really a way to simulate spot unavailability. Also, there is no failback mechanism (by design) to on-demand. This is done on purpose because Spot isn't just a cheaper on-demand but more about capacity with specific behaviors tailored to specific type of workloads (those that can survive shortage of capacity for extended periods of time without impacting the outcome etc).

Related

AWS: Migration to Fargate Spot

i am currently on AWS with ECS instances running on Fargate.
I would like to migrate to Fargate Spot because of the pricing, but I notice that my containers can have downtime, if i scheduled for example 2 tasks for each instance, can I be sure that there will not be such downtime? If no, there is a way to get rid of that downtime with Fargate Spot.
Thanks

Ultimately you can never guarantee that Amazon would not want the availability back.
For this reason depending on your usecase you can combine on demand and spot for fargate to account for a sudden loss of spot instances. This will help to mitigate in case you lose your Fargate Spot at sudden notice.
Some AWS recommendations for Fargate Spot are:
Fargate Spot is great for stateless, fault-tolerant workloads, but don’t rely solely on Spot Tasks for critical workloads, configure a mix of regular Fargate Tasks
Applications running on Fargate Spot should be fault-tolerant
Handle interruptions gracefully by catching SIGTERM signals
If you’re trying to save money but need a consistent workload then a compute savings plan might be more appropriate (or combine with spot).
For more information take a look at the Deep dive into Fargate Spot to run your ECS Tasks for up to 70% less article.

Problems with Memory and CPU limits in AWS ECS cluster running on reserved EC2 instance

I am running the ECS cluster that currently has 3 services running on T3 medium instance. Each of those services is running only one task which has a soft memory limit of 1GB, the hard limit is different for each (but that should not be the problem). I will always have enough memory to run one, new deployed task (new one will also take 1GB, and T3 medium will be able to handle it since it has 4GB total). After the new task is up and running, the old one will be stopped and I will have again 1GB free for the new deployment. I did similar to the CPU (2048 CPU, each task has 512, and 512 free for new deployments).
So everything runs fine now, but I am not completely satisfied with this setup for the future. What will happen if I need to add another service with another task? I need to deploy all existing tasks and to modify their task definitions to use less CPU and memory in order to run this new task (and new deployments). I am planning to get a reserved EC2 instance, so it will not be easy to swap the current EC2 instance with the larger one.
Is there a way to spin up another EC2 instance for the same ECS cluster to handle bursts in my tasks? Also deployments, it's not a perfect scenario to have the ability to deploy only one task, and then wait for old to be killed in order to deploy the next one, without downtimes.
And biggest concern, what if I need new service and task, I need again to adjust all others in order to run a new one and deploy others, which is not very maintainable and what if I cannot lower CPU and memory more because I already reached the lowest point in order to run the task smoothly.
I was thinking about having another EC2 instance for the same cluster, that will handle bursts, deployments, and new services/tasks. But not sure if that's possible and if that's the best way of doing this. I was also thinking about Fargate, but this is much more expensive and I cannot afford it for now. What do you think? Any ideas, suggestions, and hints will be helpful since I am desperate to find the best way to avoid the problems mentioned above.
Thanks in advance!

So unfortunately, there is no out of the box solution to ensure that all your tasks run on min possible (i.e. one) instance. You can use our new feature called Capacity Providers (CP), which will allow you to ensure the minimum number of ec2 instances required to run all your tasks. The major difference between CP vs ASG is that CP gives more weight to task placement (where as ASG will scale in/out based on resource utilization which isn't ideal in your case).
However, it's not an ideal solution. Just as you said in your comment, when the service needs to scale out during a deployment, CP will spin up another instance, the new task will be placed on it and once it gets to Running state, the old task will be stopped.
But now you have an "extra" EC2 instance because there is no way to replace a running task. The only way I can think of would be to use a lambda function that drains the new instance, which will move all the service tasks to the other instance. CP will, after about 15 minutes, terminate this instance as there are no tasks are running on it.
A couple caveats:
CP are new, a little rough around the edges, and you can't
delete/modify them. You can only create or deactivate them.
CP needs an underlying ASG and they must have a 1-1 relationship
Make sure to enable managed scaling when creating CP
Choose 100% capacity target
Don't forget to add a default capacity strategy for the cluster

Minimizing EC2 instances used:
If you're using a capacity provider, the 'binpack' placement strategy minimises the number of EC2 hosts that are used.
However, there are some scale-in scenarios where you can end up with a single task running on its own EC2 instance. As Ali mentions in their answer; ECS will not replace this running task, but depending on your setup, it may be fairly easy for you to replace it yourself by configuring your task to voluntarily 'quit'.
In my case; I always have at least 2 tasks running per service. So I just added some logic to my tasks' healthchecks, so they report as unhealthy after ~6 hours. ECS will spot the 'unhealthy' task, remove it from the load balancer, and spin up a replacement (according to the binpack strategy).
Note: If you take this approach; add some variation to your timeout so you're less likely to have all of your tasks expire at the same time. Something like: expiry = now + timedelta(hours=random.uniform(5.5,6.5))
Sharing memory 'headspace' with soft-limits:
If you set both soft and hard memory limits; ECS will place your tasks based on the soft limit. If your tasks' memory usage varies with usage, it's fairly easy to get your EC2 instance to start swapping.
For example: Say you have a task defined with a soft limit of 900mb, and a hard limit of 1800mb. You spin up a service with 4 running instances. ECS provisions all 4 of these instances on a single t3.medium. Notice here that each instance thinks it can safely use up to 1800mb, when in fact there's very little free memory on the host server. When you hit your service with some traffic; each task tries to use some more memory, and your t3.medium is incapacitated as it starts swapping memory to disk. ECS does not recover from this type of failure very well. It notices that the task instances are no longer available, and will attempt to provision replacements, but the capacity provider is very slow to replace the swapping t3.medium.
My suggestion:
Configure your service to auto-scale based on memory usage (this will be a percentage of your soft-limit), for example: a target memory usage of 70%
Configure your tasks' healthchecks so that they report as unhealthy when they are nearing their soft-limit. This way, your tasks still have some headroom for quick spikes of memory usage, while giving your load balancer a chance to drain and gracefully replace tasks that are getting greedy. This is fairly easy to do by reading the value within /sys/fs/cgroup/memory/memory.usage_in_bytes.

AWS Binpack placement strategy resulting in issues during instance autoscaling

Here is the scenario:
We are running Docker containers in an AWS ECS cluster. Previously, we were not using any placement strategy for containers. For minimizing the number of instances within the cluster, we tried introducing binpack placement strategy. After that, whenever we try to deploy containers multiple at a time(in parallel), the instances do not autoscale and stay at the minimum limit set for them. We are not sure what went wrong. Most of the services are not reaching steady state due to this. For now, we have removed binpack, and again it has started working perfectly and we are able to deploy in parallel.
Though, there is no issue when we deploy one service at a time and everything seems fine.
We are using t2.large type instances in our case.
Instance auto-scaling is happening based on memory reservation(>80% for 1 minute).
Looking at the graph, we can check memory threshold is not getting reached. It crosses >80 threshold only for few seconds, and then again goes down. This is a strange behaviour according to me.
Does binpack does not support t2 type instances? Or is there any other case I am missing?

How to speed up deployments on AWS Fargate?

After migrating from EC2 cluster instances to AWS Fargate, I realized that deployments take a lot longer. Before they would take 1-2 minutes, now some deplyoments take up to 5 minutes. This post claims that their deployments on Fargate even take up to 10 minutes.
Does anybody know of a way to speed them up? I can't find any documentation on this topic.

Through further googling I found this Reddit thread. An AWS employee wrote:
With regard to time to provision and start a container it is
definitely longer when using Fargate. We may reduce the length of the
provisioning state in the future, but Fargate is doing much more under
the hood than ECS on your own self managed hosts. When you self manage
hosts they are already up and running, and may even already have your
docker image downloaded and cached locally, so ECS is able to launch
the container very quickly. That's not the case with Fargate.
So shrinking the image should help a little. But in general I guess I'll have to live with it and hope for optimizations on AWS' side.

Here's the breakdown of tasks and possible improvements that I've found while researching options to improve my deployment times with ECS Fargate:
Fargate Deployment Overview
Here's a breakdown of what's going on behind the scenes that attribute to the deployment duration:
Provision the Fargate worker instance
Provision/attach the ENI
Download the Docker image
Here you have opportunities for improvement:
reduce the size of your Docker image
Networking throughput is based on the CPU allocations to the Fargate Task - if you allocate more CPU then you get more networking and the image will download faster
Application Startup time
Becomes a factor if your application requires a health check grace period, again effected by CPU allocation
If your task is associated with a load balancer the deployment will also need to pass health checks, and you'll need to account for:
Load balancer deregistration delay
Pass health checks: (Health Check Interval * Threshold)
How to deploy Fargate Task updates faster
Over allocate the CPU
Reduce the deregistration delay
Set the health check threshold to 2 and interval to 5 seconds
don't forget to account for a health check grace period if your app needs it
My Results
During my testing, I was able to deploy my application that typically takes about 8 minutes w/1024 CPU (1vCPU) in under 4 minutes w/4096 CPU (4vCPU)
Disclaimer
Likely your tasks typically require considerably less CPU and you don't want to be always paying for over-allocating the CPU. So, run your deployment with overallocated resources and then run another deployment right after with the original CPU allocation.
Probably not a solution you want to use for every deployment, but could be a solution for hotfix deployments.
Additional Reading
Highly recommend reading Scaling containers on AWS in 2022

Two reasons they're slower, in my experience:
awsvpc network mode attaches an ENI to the task. When it has to do this to a Lambda, if the Lambda is running in a VPC, it is known to dramatically increase the initial spin up time.
Docker image size also affects startup time, since the image will usually need to be downloaded to whatever hidden host for a task to launch. I've done some benchmarking with a small 200MB container and a 2.5GB container. The former did start up quicker.
You can't do much about awsvpc, since Fargate requires it. Shrinking down that image would be your next biggest impact.

One big EC2 or multiple small EC2 or one ECS - which is cost-effective?

We are running 6 Java Spring Boot based microservice projects in one AWS t2.large(2 CPU & 8GB RAM) EC2 machine. When we see the CPU and RAM utilization of this EC2 machine, CPU is underutilized, hardly hits 1% but the RAM usage is always above 85%.
Now we want to add 2 more microservice. We definitely can't deploy these 2 services in the EC2 which we already have as the RAM will be exhausted. So, we need to add one more EC2 machine or we need to upgrade t2.large to t2.xlarge(4 CPU & 16GB RAM) to deploy the 2 new services but we also want to be cost-effective.
We are studying some of the AWS articles and it says that running ECS services will be cost-effective as the billing is based on how much CPU and RAM is being used(reference - ECS Fargate Pricing). As our CPU is always underutilized, we are thinking that ECS Fargate pricing policy will reduce our cost.
So, 3 questions with respect to cost saving for the above case,
If we prefer ECS, will that be really cost-effective than adding one more EC2 instance as ECS pricing is based on how much resources are utilized?
If we need to deploy 8 projects in ECS, will that create 8 EC2 instances or we can still use 1 or 2 EC2 machines for all 8 projects deployment?
Should we completely re-think about our deployment process in a different way?

If we prefer ECS, will that be really cost-effective than adding one more EC2 instance as ECS pricing is based on how much resources are utilized?
This is worded a little awkward, but I think what you're asking is will Fargate be more cost effective than standard EC2 instances. That, I think, is pretty difficult for us to determine for you, though you should be able to do some quick estimations based on your real-world usage and compare that with the cost of the smallest necessary instance type.
To throw another option in the mix, consider Reserved Instances, as they come at a pretty steep discount if you're willing to reserve them for some period of time. From the linked doc, you'd be looking at somewhere around a 40% discount (compared to on-demand) for a 1 year reservation, and 60% for a 3 year reservation.
Additionally, consider different instance types, specifically the memory optimized instances r4 and x1. These let you purchase more memory and less cpu.
If we need to deploy 8 projects in ECS, will that create 8 EC2 instances or we can still use 1 or 2 EC2 machines for all 8 projects deployment
Assuming you're not referring to Fargate here, the answer is ... it depends; it depends on how you design your system. You can host as many containers on an instance as you want (assuming you have the resources) -- but whether or not that's what you want from a high-availability / disaster scenario is for you to decide.
Should we completely re-think about our deployment process in a different way?
This will be answered when you come to a conclusion on your first question.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js