AWS Batch EC2 Provision Time - amazon-web-services

I'm relatively new to using AWS Batch, and have been noticing it takes a LONG time to spin up EC2 instances in a managed compute environment.
My jobs will go from Submitted > Pending > Runnable within 1 minute.
But sometimes they will sit in Runnable anywhere from 15 minutes to 1 hour before an EC2 instance finally gets around to spinning up.
Any tips and tricks on getting AWS Batch to spin up instances more quickly?
Ideally I'd like an instance the moment somethings in the Runnable state.
For some more context, I am using AWS Batch essentially like Lambda but choose your own instance and hard drive. I can't use lambda because the jobs need a lot more resources (GPUs) and time to process.

It would appear the scheduler takes its time based on non-transparent load at the data center.
Would love if creating a Batch Job returned estimated TTL.
But anyways, sometimes I get machines instantly, sometimes it takes up to 15 minutes, and sometimes it will take an hour or more for newer GPU instance types, because there are not any available.
There doesn't appear to be anyway to control the schedule. Oh well.

Note: Below setting might help reduce provision time, but will incur additional costs.
Compute environments -> Compute resources -> Minimum vCPUs
Making this = 1 (or more) will allow single instance to run all the time.
Compute environments -> Compute resources -> Allocation strategy
Changing this from "BEST_FIT" to "Best_Fit_Progressive" will also help.

Related

Cloudwatch Period time

CPU metrics cannot be selected below 1 minute in Cloudwatch service. For example, how can I lower this period time to trigger the Autoscale scale faster? I just need to trigger the AutoScale instances in short time. (By the way, datapoints value 1 to 1)
the minimum granularity for the metrics that EC2 provides is 1 minute.
Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html
Would also say that if you need to scale that quickly, wouldn't the startup time be an issue anyway?
You are correct -- basic monitoring of an Amazon EC2 instance provides metrics over 5-minute periods. If you activate EC2 Detailed Monitoring, metrics are provided over 1-minute periods. Extra charges apply for Detailed Monitoring.
When launching a new instance via Amazon EC2 Auto-Scaling, it can take a few minutes for the new instance to launch and for the User Data script (if any) to run. Linux instances are quite fast, but Windows instances take a while on their first boot due to sysprep operations.
You mention that you want to react to a metric in less than one minute. I would suggest that this would not be an ideal way to trigger Auto-scaling. Sometimes a computer can be busy for a while, then can drop down again. Reacting too quickly to a high CPU load would cause the Auto-Scaling group to flap between adding instances and terminating instances. It is better to provision enough capacity for a reasonable amount of extra load and then gradually add more capacity as it is required over time.
If you have a need to react so quickly, then perhaps you should investigate using AWS Lambda to perform small amounts of work in a highly-parallel fashion rather than relying on Amazon EC2 instances.

Increase and decrease AWS instances CPUs automatically

is there's a way in AWS to increase and decrease instances CPUs depending on pressure. I have been paying a lot of money for AWS statically increasing and decreasing instance cores when no clients are using it.
to be more specific, clients can upload an excel file and the software will do some calculations that will take time depending on the AWS instance cores. Having 2 cores will take 30 minutes to completion and having 96 cores will take only a couple of minutes.
Is there's a way to automatically increase the cores to 96 when the clients are using and uploading files to the website and automatically decrease the cores to 2 when no action is happening and clients are either not using the website or just using the website with current data and aren't taking a new action.
If not then can I possibly add a schedule in AWS to change the instance type. As an example run the instance on a 2 core type (ex: t2.large) and then change the instance type only from 1pm-6pm to 96 cores (ex: c5a.24xlarge) after that get it back to 2 cores?
I'm very new to AWS and devops in general, and I have been reading about AWS Autoscaling groups, but I'm not sure if this is the answer for my problem.
No, it is not possible to "scale CPU cores". (Commonly known as Vertical scaling.)
Instead, the recommended method is to add/remove parallel capacity based upon demand.
If you are using Amazon EC2, then you can launch more instances or terminate existing instances. This can be automated through Amazon EC2 Auto Scaling, which can monitor metrics (eg CPU Utilization) and then launch/terminate instances automatically. You would typically put a Load Balancer in front of these instances if they are web servers, or the instances might be 'worker nodes' that pull work from a queue.
If you are using containers (Docker, Kubernetes) then Amazon ECS/Amazon EKS can automatically add/remove tasks to meet demand for your application.
If you are using AWS Lambda functions, then they 'scale' by allowing multiple functions to run in parallel. Lambda functions typically exit when they have finished processing, so there is not charge when there is nothing to process.
These are all examples of Horizontal scaling, where capacity is added/removed in parallel.

How to spin up all nodes in my EMR cluster before running my spark job

I have an EMR cluster that can scale up to a maximum of 10 SPOT nodes. When not being used it defaults to 1 CORE node (and 1 MASTER) to save costs obviously. So in total it can scale up to a maximum of 11 nodes 1 CORE + 10 SPOT.
When I run my spark job it takes a while to spin up the 10 SPOT nodes and my job ends up taking about 4hrs to complete.
I tried waiting until all the nodes were spun up, then canceled my job and immediately restarted it so that it can start using the max resources immediately, and my job took only around 3hrs to complete.
I have 2 questions:
1. Is there a way to make YARN spin up all the necessary resources before starting my job? I already specify the spark-submit parameters such as num-executors, executor-memory, executor-cores etc. during job submit.
2. I havent done the cost analysis yet, but is it even worthwhile to do number 1 mentioned above? Does AWS charge for spin up time, even when a job is not being run?
Would love to know your insights and suggestions.
Thank You
I am assuming you are using AWS managed scaling for this. If you can switch to custom scaling you can set more aggressive scaling rules, you can also set the numbers of nodes to scale up by on each upscale and downscale, this will help you converge faster to the required number of nodes.
The only downside to custom scaling is that it will take 5 minutes to trigger.
Is there a way to make YARN spin up all the necessary resources before
starting my job?
I do not know how to achieve this. But, In my opinion, this is not worth doing it. Spark is intelligent enough to do this for us.
It knows how to distribute the task when more instances come up or go away in the cluster. There is a certain spark configuration which you should be aware of to achieve this.
You should set this to true spark.dynamicAllocation.enabled. There are some other relevant configurations that you can change or leave it as it is.
For more detail refer to this documentation spark.dynamicAllocation.enabled
Please see the documentation as per your spark version. This link is for the spark version 2.4.0
Does AWS charge for spin up time, even when a job is not being run?
You get charged for every second of the instance that you use, with a one-minute minimum. It is not important whether your job is being run or not. Even If they are idle in the cluster, you will have to pay for it.
Refer to these link for more detail:
EMR FAQ
EMR PRICING
Hope this gives you some idea about the EMR pricing and certain spark configuration related to the dynamic allocation.

Problems with Memory and CPU limits in AWS ECS cluster running on reserved EC2 instance

I am running the ECS cluster that currently has 3 services running on T3 medium instance. Each of those services is running only one task which has a soft memory limit of 1GB, the hard limit is different for each (but that should not be the problem). I will always have enough memory to run one, new deployed task (new one will also take 1GB, and T3 medium will be able to handle it since it has 4GB total). After the new task is up and running, the old one will be stopped and I will have again 1GB free for the new deployment. I did similar to the CPU (2048 CPU, each task has 512, and 512 free for new deployments).
So everything runs fine now, but I am not completely satisfied with this setup for the future. What will happen if I need to add another service with another task? I need to deploy all existing tasks and to modify their task definitions to use less CPU and memory in order to run this new task (and new deployments). I am planning to get a reserved EC2 instance, so it will not be easy to swap the current EC2 instance with the larger one.
Is there a way to spin up another EC2 instance for the same ECS cluster to handle bursts in my tasks? Also deployments, it's not a perfect scenario to have the ability to deploy only one task, and then wait for old to be killed in order to deploy the next one, without downtimes.
And biggest concern, what if I need new service and task, I need again to adjust all others in order to run a new one and deploy others, which is not very maintainable and what if I cannot lower CPU and memory more because I already reached the lowest point in order to run the task smoothly.
I was thinking about having another EC2 instance for the same cluster, that will handle bursts, deployments, and new services/tasks. But not sure if that's possible and if that's the best way of doing this. I was also thinking about Fargate, but this is much more expensive and I cannot afford it for now. What do you think? Any ideas, suggestions, and hints will be helpful since I am desperate to find the best way to avoid the problems mentioned above.
Thanks in advance!
So unfortunately, there is no out of the box solution to ensure that all your tasks run on min possible (i.e. one) instance. You can use our new feature called Capacity Providers (CP), which will allow you to ensure the minimum number of ec2 instances required to run all your tasks. The major difference between CP vs ASG is that CP gives more weight to task placement (where as ASG will scale in/out based on resource utilization which isn't ideal in your case).
However, it's not an ideal solution. Just as you said in your comment, when the service needs to scale out during a deployment, CP will spin up another instance, the new task will be placed on it and once it gets to Running state, the old task will be stopped.
But now you have an "extra" EC2 instance because there is no way to replace a running task. The only way I can think of would be to use a lambda function that drains the new instance, which will move all the service tasks to the other instance. CP will, after about 15 minutes, terminate this instance as there are no tasks are running on it.
A couple caveats:
CP are new, a little rough around the edges, and you can't
delete/modify them. You can only create or deactivate them.
CP needs an underlying ASG and they must have a 1-1 relationship
Make sure to enable managed scaling when creating CP
Choose 100% capacity target
Don't forget to add a default capacity strategy for the cluster
Minimizing EC2 instances used:
If you're using a capacity provider, the 'binpack' placement strategy minimises the number of EC2 hosts that are used.
However, there are some scale-in scenarios where you can end up with a single task running on its own EC2 instance. As Ali mentions in their answer; ECS will not replace this running task, but depending on your setup, it may be fairly easy for you to replace it yourself by configuring your task to voluntarily 'quit'.
In my case; I always have at least 2 tasks running per service. So I just added some logic to my tasks' healthchecks, so they report as unhealthy after ~6 hours. ECS will spot the 'unhealthy' task, remove it from the load balancer, and spin up a replacement (according to the binpack strategy).
Note: If you take this approach; add some variation to your timeout so you're less likely to have all of your tasks expire at the same time. Something like: expiry = now + timedelta(hours=random.uniform(5.5,6.5))
Sharing memory 'headspace' with soft-limits:
If you set both soft and hard memory limits; ECS will place your tasks based on the soft limit. If your tasks' memory usage varies with usage, it's fairly easy to get your EC2 instance to start swapping.
For example: Say you have a task defined with a soft limit of 900mb, and a hard limit of 1800mb. You spin up a service with 4 running instances. ECS provisions all 4 of these instances on a single t3.medium. Notice here that each instance thinks it can safely use up to 1800mb, when in fact there's very little free memory on the host server. When you hit your service with some traffic; each task tries to use some more memory, and your t3.medium is incapacitated as it starts swapping memory to disk. ECS does not recover from this type of failure very well. It notices that the task instances are no longer available, and will attempt to provision replacements, but the capacity provider is very slow to replace the swapping t3.medium.
My suggestion:
Configure your service to auto-scale based on memory usage (this will be a percentage of your soft-limit), for example: a target memory usage of 70%
Configure your tasks' healthchecks so that they report as unhealthy when they are nearing their soft-limit. This way, your tasks still have some headroom for quick spikes of memory usage, while giving your load balancer a chance to drain and gracefully replace tasks that are getting greedy. This is fairly easy to do by reading the value within /sys/fs/cgroup/memory/memory.usage_in_bytes.

AWS run instance for exact amount of full billing cycles

I'm trying to optimize cost for my project for some valid reasons we're running it on very expensive instances.
To the best of my knowledge Amazon charges by hours. For instance, if I'm running my EC2 instance for 1 hour and 4 minutes I'll be charged 2 hours.
What would be the best way to shut down instance closest to the next billing cycle, but not exceeding current one?
I was trying to do this based on uptime, but there is some difference between aws billing and uptime value.
I'm looking to use watchdog sitting on the instance itself. So I can pass parameters during provision and it will shut down itself say after 2 full billing cycles.
You can get the time that Amazon starts billing from the EC2 instance (assumes you have jq installed)
curl -s http://169.254.169.254/latest/dynamic/instance-identity/document/ | jq .pendingTime
and you could run a shell script once a minute to shut down after, say 58 minutes.
But this is a pain. If your processing is able to handle interruptions of an instance running then you should look at using spot instances perhaps with a fixed duration. This allows you to run at a reduced price for a known period of time without any additional costs because of running over.
If your workload is complete before the full hour, stop/terminate your instance right away when the work is complete. No need to keep the instance idle for the remainder of the hour.
The only time this may not be efficient is if you may have more work coming in before the full hour, and then you want to keep it running to process that new work. But that will only be the case if work is sporadic. And if it is sporadic, then it just may be better to keep it running.