I've been with a weird problem for some days.
I'm implementing the ECS logic to drain instances on termination (specifically on Spot interruption notice) using the ECS_ENABLE_SPOT_INSTANCE_DRAINING=true env var on the ecs-agent.
The process works fine, when an interruption notice arrives, ECS drains the instance and moves the containers to another one, but here is the problem, if the instance never started that image before, it takes too much time to start (About 3 min, when the spot interruption time is in 2 min) causing availability issues. If the image started in that instance before, it only takes 20 sec to spin up the task!
Have you experienced this problem before using ECS?
PD: The images are about 500MB is that large for an image??
There are some strategies available to you:
Reduce the size of the image by optimising the Dockerfile. A smaller image is quicker to pull from the repository.
Bake the large image into the AMI used in the cluster. Now every new spot machine will have the image. Depending on how the Dockerfile is created, a significant number of layers could be reused resulting on quicker image pulls.
Once the image is pulled to the machine, the image is cached and subsequent pulls will almost be instantaneous.
Related
I have an issue with long starting time for AWS Batch job. It's random, sometimes it takes second to transition from STARTING to RUNNING, but sometimes it takes more than 10 minutes. According to the documentation, in STARTING state container initiation operations are done, so I understand it can take some time to download and run container on newly created machine in compute environment, but it happens also on machines that were used just before and should have the container already prepared.
Is there any way I can optimise job's STARTING time?
Duration of STARTING state depends on how fast an environment can pull and start a container.
You can speed things up if you use smaller Docker image (smaller the image, faster pulling and starting the container) and higher vCPU and RAM. All of these can be configured in Job Definition.
it happens also on machines that were used just before and should have the container already prepared
You don't control how AWS manages an environment.
I'm building a backend service for an image-processing app. User-created images are uploaded in groups of 20-40 images to Firebase Cloud Storage, and need to be processed by GPU-accelerated hardware, for which I plan to use Amazon EC2, before going back to storage to be downloaded by the user application.
Each group of images has a "due date" which is generated server-side, and may be anywhere from 8 hours to 72 hours from submission. This metadata will be stored in Cloud Firestore where the groups will be indexed by their due date for easy queuing. On a G4ad or G4dn EC2 instance, I estimate that one group of images should take less than one minute to process.
I want to minimise the total server cost by taking advantage of EC2 spot prices; however, it is fairly critical that jobs are finished before their "due date" (e.g. when 1 hour remains on the next due job, I should forgo the spot prices and just pay the on-demand prices). I don't anticipate enough volume initially to justify a dedicated instance or commit to higher usage plans.
How could I architect a solution that minimises cost while respecting the due date of jobs?
What I've already considered
1. AWS Batch
AWS Batch seems to be used commonly for queuing jobs. I could enqueue jobs specifying the IDs of the images to be processed, and the job would then fetch these images when running.
However, AWS Batch only seems to support an approximate FIFO ordering of jobs (while I want them to be ordered by their due date). I also don't see a mechanism to switch between spot/on-demand instances based on the time in the queue.
2. Manual management
I could allocate a Spot instance and an On-Demand instance, and switch between the two depending on the due date of the next job. I'd probably set up a cloud function on AWS Lambda or google cloud to poll (at regular intervals e.g. 5 minutes) the due date of the next due job in Firestore:
If there is no job, do nothing;
If there is a job due in the next hour, and the Spot instance is not running, start the On-Demand instance. This instance will process the next due jobs (from Firestore) until the queue is empty or at least an hour ahead;
Otherwise (there is a job due in more than an hour), attempt to start the Spot instance. If there is capacity available, the instance will run continuously but terminate itself if it finishes the entire queue.
I'm unsure if this is a good approach so I am open to any feedback or alternatives.
I don't think using only spot instances is a good choice for you. Yes, spot instances are prefect for batch image processing on condition that you can tolerate their unpredictable nature. They can be started/terminated at any time.
Since your batch jobs are time critical, I think it would be better to have on-demand instances (can save lots with reserved instances) which you can always relay on and don't worry about them being terminated just before your due dates.
As a compromise you could use spot fleets. The spot fleet allow you to launch a mixture of on-demand and spot-instances. In your case on-demand portion in the fleet should be large enough to handle entire workload in case there are no spot capacity. If there is spot capacity, then that's good, your batch processing will finish earlier and you save money. In the worst case, ever image is processed by the on-demand instances by the due day.
SQS FIFO queues do preserver order of messages, so you don't have to worry about images being processed in different order then submitted to the queue. But then FIFO queues pose some issue with parallel processing as you always have to keep order of messages. You would have to use different message groups in FIFO SQS to be able to process messages in parallel.
Running time per node type is variable/controllable which makes the following:
Cost1 = ( node1 price per hour ) x ( running time1 )
Cost2 = ( node2 price per hour ) x ( running time2 )
...
Costn = (node n...) x ( running time n)
Total cost = sum of cost 1 + .... + cost n
Fitness = 1.0/ (total cost * normalized due date breach ratio)
A solvable problem by genetic algorithm if you can use the total time (due date) constraint by predicting image processing performance of gpus.
Since you already have gpus at hand, it would take milliseconds, if not seconds, to solve it.
Even if you mis-predict static gpu performances, it would converge to solution if you solve it repeatedly during whole computation time and adjust the server allocation dynamically. (You can measure current performance of each node by measuring work given to them and their response time and adjust the work distribution accordingly for next iteration of genetic algorithm)
Another way could be:
start 1 server
if due time is 50% approached and tasks are less than 50% complete
then start 2 more servers
if due time is 75% approached and tasks are less than 75% complete
then start 4 more servers
if due time is 87% apprached and tasks are less than 87% complete
then start 8 more servers
if 94% .... then 16 servers
if 97% ... then 32 servers
This may not be as efficient as solving the minimization problem.
So I have a large dataset (1.5 Billion) I need to perform a I/O bound transform task on (same task for each point) and place the result into a store that allows fuzzy searching on the transforms fields.
What I currently have is a Step-Function Batch Job Pipeline feeding into RDS. It works like so
Lamba splits input data into X number of even partitions
An Array Batch job is created with X array elements matching the X paritions
Batch jobs (1 vCPU, 2048 Gb ram) run on number of EC2 spot instances, transform the data and place it into RDS.
This current solution (with X=1600 workers) runs in about 20-40 minutes, mainly based on the time it takes to spin up spot instance jobs. The actual jobs themselves average about 15 minutes in run time. As for total cost, with spot savings the workers cost ~40 bucks but the real kicker is the RDS postgres DB. To be able to handle 1600 concurrent writes you need at least a r5.xlarge which is 500 a month!
Therein lies my problem. It seems I could run the actual workers quicker and for cheaper ( due to second based pricing) by having say 10,000 workers but then I would need a RDS system that could handle 10,000 concurrent DB connections somehow.
I've looked high and low and can't find a good solution to this scaling wall I am hitting. Below I'll detail some things I've tried and why they haven't worked for me or don't seem like a good fit.
RDS proxies - I tried creating 2 proxies set to 50% connection pool and giving "Even" numbered jobs one proxy and odd numbered jobs the other but that didn't help
DynamoDb - This seems off the bat to solve my problem hugely concurrent, can definitely handle the write load but it doesn't allow fuzzy searching like select * where field LIKE Y which is a key part of my workflow with the batch job results
(Theory) - have the jobs write their results to S3 then trigger a lambda on new bucket entries to insert those into the DB. (This might be a terrible idea I'm not sure)
Anyways, what I'm after is improving the cost of running this batch pipeline (mainly the DB), improving the time to run (to save on Spot costs) or both! I am open to any feedback or suggestion!
Let me know if there's some key piece of info you need I missed.
So I have a set of long running tasks that have to be run on Compute Engine and have to scale. Each task takes approximately 3 hours. So in order to handle this I thought about using:
https://cloud.google.com/solutions/using-cloud-pub-sub-long-running-tasks
Architecture. And while it works fine there is one huge problem. On scale down, I'd really like to avoid it scaling down a task that is currently running! I'd potentially lose 3 hours worth of processing.
Is there a way to ensure that autoscale down doesn't scale down a VM with a long running / uptime?
EDIT: A few people have asked to elaborate my task. So it's similar to what's described in the link above which is many long running tasks that need to be run on a GPU. There is a chunk of data that needs to be processed. It takes 4 hours (video encoding) then once completed it outputs to a bucket. Well it can take anywhere from 1 to 6 hours depending on the length of the video. Just like the architecture above it would be nice to have the cluster scale up based on queue size. But when scaling down I'd like to ensure that it's not scaling down currently running tasks which is what is currently happening. It being GPU bound doesn't allow me to use the CPU metric.
I think you should probably add more details about what kind of task you are running. However, as #Jhon Hanley suggestion, it worth to take a look of Cloud Tasks and see as well the following documentation that talks about the scaling risks.
I've got an application that is built in node.js, and is primarily used to post photos to (up to 25mb). The app resizes to thumbnail size, and moves both the thumbnail and full size image to S3. When the uploads begin happening, they usually come in bursts of 10-15 pictures, rinse, wash, repeat in 5 minute durations. I'm seeing a lot of scaling, and the trigger is the default 6MB NetworkOut trigger. My question is, is the moving the photos to S3 considered NetworkOut? Or should I consider a different scaling trigger, so far the app hasn't stuttered so I'm hesitant to not fix what ain't broken, but I am seeing quite a big of scaling so I thought I would investigate. Thanks for any help!
The short answer - scale when ever a resource is constrained. eg, If your instances can keep up with network IO or cpu is above 80% then scale. Yes, sending any data from your ec2 instance is network out traffic. You got to get that data from point A to B somehow :)
As you go up in size on ec2 instances you get more memory and cpu along with more network IO. If you don't see issue with transfers you may want to switch the auto scale over to watch cpu or memory. In an app I'm working on users can start jobs which require a bit of cpu. So I have my auto-scale to scale if my cpu is over 80%. But you might have a process that consumes a lot of memory and not much cpu...
On a side note - you may want to think about having your uploads go directly to your s3 bucket and use a lambda to trigger the resize routine. This has several advantages over your current design. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
I suggest getting familiar with the instance metrics. You can then recognize your app-specific bottlenecks on the current instance type and count.
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-metrics.html