AWS: Migration to Fargate Spot - amazon-web-services

i am currently on AWS with ECS instances running on Fargate.
I would like to migrate to Fargate Spot because of the pricing, but I notice that my containers can have downtime, if i scheduled for example 2 tasks for each instance, can I be sure that there will not be such downtime? If no, there is a way to get rid of that downtime with Fargate Spot.
Thanks

Ultimately you can never guarantee that Amazon would not want the availability back.
For this reason depending on your usecase you can combine on demand and spot for fargate to account for a sudden loss of spot instances. This will help to mitigate in case you lose your Fargate Spot at sudden notice.
Some AWS recommendations for Fargate Spot are:
Fargate Spot is great for stateless, fault-tolerant workloads, but don’t rely solely on Spot Tasks for critical workloads, configure a mix of regular Fargate Tasks
Applications running on Fargate Spot should be fault-tolerant
Handle interruptions gracefully by catching SIGTERM signals
If you’re trying to save money but need a consistent workload then a compute savings plan might be more appropriate (or combine with spot).
For more information take a look at the Deep dive into Fargate Spot to run your ECS Tasks for up to 70% less article.

Related

Simulating and Testing Fargate Spot with ECS

Recently I've been looking in to Fargate Spot with ECS in more detail and trying to understand the capacity providers in ECS a little better. I'm struggling to understand some of the details and I'm struggling to test some scenarios.
I'm trying to understand what would happen if you have a capacity provider that looks like the below if Fargate Spot capacity is unavailable?
I understand that it will launch 6 tasks using Fargate and then allocate additional tasks using Fargate Spot.
What if there is no Fargate Spot capacity available? What would happen?
From what I can see online, there is no failover between capacity providers. Is this correct?
Is there a way to simulate spot not being available?
There isn't really a way to simulate spot unavailability. Also, there is no failback mechanism (by design) to on-demand. This is done on purpose because Spot isn't just a cheaper on-demand but more about capacity with specific behaviors tailored to specific type of workloads (those that can survive shortage of capacity for extended periods of time without impacting the outcome etc).

HPA on EKS-Fargate

this is not a question about how to implement HPA on a EKS cluster running Fargate pods... It´s about if it is necessary to implement HPA along with Fargate, because as far as I know, Fargate is a "serverless" solution from AWS: "Fargate allocates the right amount of compute, eliminating the need to choose instances and scale cluster capacity. You only pay for the resources required to run your containers, so there is no over-provisioning and paying for additional servers."
So I´m not sure in which cases I would like to implement HPA on an EKS cluster running Fargate but the option is there. So I would like to know if someone could give more information.
Thank you in advance
EKS/Fargate allows you to NOT run "Cluster Autoscaler" (CA) because there are not nodes you need to run your pods. This is what it is referred to with "no over-provisioning and paying for additional servers."
HOWEVER, you could/would use HPA because Fargate does not provide a resource scaling mechanism for your pods. You can configure the size of your Faragte pods via K8s requests but at that point that is a regular pod with finite resources. You can use HPA to determine the number of pods (on Fargate) you need to run at any point in time for your deployment.

How can I understand `Nodes` in EKS Fargate?

I deployed a EKS cluster and a fargate profile. Then I deployed a few application to this cluster. I can see these fargate instances are launched.
when I click each of this instance, it shows me some information like os, image etc. But it doesn't tell me the CPU and memory. When I look at fargate pricing: https://aws.amazon.com/fargate/pricing/. It is calculated based on CPU and Memory.
I have used ECS and it is very clear that I need to provision CPU/Memory in service/task level. But I can't find anything in EKS.
How do I know how much resources they are consuming?
With Fargate you don`t have provision, configure or scale virtual machines to run your containers so that they become fundamental compute primitive.
This solution model is called serverless where you are being charged for only the compute resources and storage that are need to execute some piece of your code. It does not mean that there are not server involved in this, it just you don`t need to care about those.
To monitor there those you can use CloudWatch. Below documents describe how this can be achieved:
How do I troubleshoot high CPU utilization on an Amazon ECS task on
Fargate?
How can I monitor high memory utilization for Amazon ECS tasks on
Fargate?
It is worth to mention that Fargate is just a launch type for ECS (Another one is EC2). Please have a look at the diagram in this document for clear image of how those are connected. The CloudWatch metrics are collected automatically for Fargate. If you are using the AKS with Fargate you can monitor them with usage of metrics-addon or prometheus inside your kubernetes cluster.
Here's an example of monitoring Fargate with Prometheus. Notice that it scrapes the metrics from CloudWatch.

Is AWS Fargate true serverless like Lambda? Does it automatically shutsdown when it finishes the task?

This really wasn't clear for me in the Docs. And the console configuration is very confusing.
Will a Docker Cluster running in Fargate mode behind a Load Balancer shutdown and not charge me while it's not being used?
What about cold starts? Do I need to care about this in Fargate like in Lambda?
Is it less horizontal than Lambda? A lambda hooked to API Gateway will spawn a new function for every concurrent request, will Fargate do this too? Or will the load balancer decide it?
I've been running Flask/Django applications in Lambda for some time (Using Serverless/Zappa), are there any benefits in migrating them to Fargate?
It seems to be that it is more expensive than Lambda but if the Lambda limitations are not a problem then Lambda should always be the better choice right?
Will a Docker Cluster running in Fargate mode behind a Load Balancer shutdown and not charge me while it's not being used?
This will depend on how you configure your AutoScaling Group. If you allow it to scale down to 0 then yes.
What about cold starts? Do I need to care about this in Fargate like in Lambda?
Some good research has been done on this here: https://blog.cribl.io/2018/05/29/analyzing-aws-fargate/
But the takeaway is for smaller instances you shouldnt notice any more and ~40seconds time to get to a running state. For bigger ones this will take longer.
Is it less horizontal than Lambda? A lambda hooked to API Gateway will spawn a new function for every concurrent request, will Fargate do this too? Or will the load balancer decide it?
ECS will not create a new instance for every concurrent request,any scaling will be done off the AutoScaling group. The load balancer doesnt have any control over scaling, it will exclusively just balance load. However the metrics which it can give can be used to help determine if scaling is needed
I've been running Flask/Django applications in Lambda for some time (Using Serverless/Zappa), are there any benefits in migrating them to Fargate?
I havent used Flask or Django, but the main reason people tend to migrate over to serverless is to remove the need to maintain the scaling of servers, this inc managing instance types, cluster scheduling, optimizing cluster utilization
#abdullahkhawer i agree to his view on sticking to lambdas. Unless you require something to always be running and always being used 99% of the time lambdas will be cheaper than running a VM.
For a pricing example
1 t2.medium on demand EC2 instance = ~$36/month
2 Million invocations of a 256MB 3 second running lambda = $0.42/month
With AWS Fargate, you pay only for the amount of vCPU and memory resources that your containerized application requests from the time your container images are pulled until the AWS ECS Task (running in Fargate mode) terminates. A minimum charge of 1 minute applies. So, you pay until your Task (a group of containers) is running, more like AWS EC2 but on a per-minute basis and unlike AWS Lambda where you pay per request/invocation.
AWS Fargate doesn't spawn containers on every request as in AWS Lambda. AWS Fargate works by simply running containers on a fleet of AWS EC2 instances internally managed by AWS.
AWS Fargate now supports the ability to run tasks on a scheduled basis and in response to AWS CloudWatch Events. This makes it easier to launch and stop container services that you need to run only at a certain time to save money.
Keeping in mind your use case, if your applications are not making any problems in the production environment due to any AWS Lambda limitations then AWS Lambda is the better choice. If the AWS Lambda is being invoked too much (e.g., more than 1K concurrent invocations at every point of time) in the production environment, then go for AWS EKS or AWS Fargate as AWS Lambda might cost you more.

Auto Scaling for EC2 Celery workers with ASG or Spot Fleets?

I want to setup auto scaling of EC2 instances on my celery cluster, and I've been doing this manually spinning up new EC2 instances whenever I see (manually) that the SQS queue experiences low throughput.
Searching around, I've came across two seemingly similar solutions:
AutoScalingGroup with EC2 Intances using LaunchConfigurations configured to use Spot Instances
SpotFleet with direct response actions to SQS metrics
Most questions on SO are dated (6 months is not much, but it's basically the release date of SpotFleet auto scaling feature) and mention that SpotFleet lacks ASG features.
Particularly concerning: running the Celery worker will require me to run some setup (install package, download code, run some script). I'm not particularly worried about cost (both are spot instances, close enough) and reliability (workers can die without a problem, the exact size of the cluster is not that important either).
Which option will work as best practice to get this done?