AWS Batch Service taking longer time to launch instance - aws-batch

I am observing that if a new batch job has been submitted shortly after the last instance in a compute environment has shut down it takes over 10 minutes for Batch to add a new instance to the Compute environments even if CE's have available resources.
Can anyone who is aware please let me know if this is an expected behavior from AWS side or if there is a way to fix it.
Thanks
Chaitanya

This is expected behaviour. There is a thread here which includes a response from the AWS team.
Further, it is important to note that the AWS Batch resource scaling
decisions occur on a different frequency. Upon receiving your first
job submission, AWS Batch will launch an initial set of compute
resources. After this point Batch re-evaluates resource needs
approximately every 10 minutes. By making scaling decisions less
frequently, we avoid scenarios where AWS Batch would scale up too
quickly and complete all RUNNABLE jobs, leaving a large number of
unused instances with partially consumed billing hours.
(p.s. I notice that it looks like yours was the last comment in the linked thread, but I post this anyway for the benefit of others)

Related

Calling lambda functions programatically every minute at scale

While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.
In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.

Running thousands of scheduled jobs in AWS on a regular cadence?

I'm architecting an application solution in AWS and am looking into options AWS has for running one-off jobs to run on a regular schedule.
For example, we have a task that needs to run every 5 minutes that does an API call to an external API, interprets the data and then possibly stores some new information in a database. This particular task is expected to run for 30 seconds or so and will need to be run every 5 minutes. Where this gets a little more complex is we're running a multi-tenant application and this task needs to be performed for each tenant individually. It doesn't satisfy the user's requirements to have a single process do the specified task for each tenant in sequence. The task must be performed every x minutes (sometimes as low as every minute) and it must complete for each tenant as quickly as it takes to perform the task exactly 1 time. In other words, all 200, let's say, tenants must have a task run for them at midnight that each have their task complete in the time it takes to query the API and update the database for one tenant.
To add to the complexity a bit, this is not the only task we will be running on a regular schedule for our tenants. In the end we could have dozens of unique tasks, each running for hundreds of tenants, resulting in thousands or tens of thousands of unique concurrent tasks.
I've looked into ECS Scheduled Tasks which uses CloudWatch Events (which is now the EventBridge) but the EventBridge has a limit of 300 rules per event bus. I think that means we're going to be out of luck if we need to have 10,000 rules (one for each task * the number of tenants), but I'm honestly not sure whether each account gets its own event bus or if that's divided up differently.
In any case, even if this did work, it's still not a very attractive option to me to have 10,000 different rules set up in the EventBridge. At least, it feels like it might be difficult to manage. To that end I'm now more so looking into just creating a single EventBridge rule per event type that will kick off a parent task, that in turn asynchronously kicks off as many asynchronous instances of a child task that is needed, one per tenant. This would limit our EventBridge rules to somewhere around a few dozen. Each one of these, when triggered, would asynchronously spawn a task for each tenant that can all run together. I'm not 100% sure on what type of object this will spawn, it wouldn't be a Lambda since that would easily cause us to hit the 1,000 concurrent Lambda function limit but it might be something like a Fargate ECS task that executes for a few seconds then goes away when it's completed.
I'd love to hear others thoughts on these options, my current direction and any other options I'm currently missing.
You don't necessarily need to look at ECS for this, because 1,000 invocations of a Lambda at a time is only the default concurrency limit. That is something you can request an increase for in the Service Quotas console:
There is no maximum concurrency limit for Lambda functions. However, limit increases are granted only if the increase is required for your use case.
Source: AWS Support article.
Same goes for the 300 rules per event bus limit. That is also a default limit and can be increased upon request in the Service Quotas console.
Since you mentioned branching logic, I wonder if you've looked into AWS Step Functions? In particular, Express Workflows within Step Functions may suit the duration and rates of your executions.

Fargate autoscaling, possible to choose which task to drop?

So as far as I can tell, if using any of the normal scaling methods like step scaling for Fargate the result when scaling down is that a random task is selected. However I'd like to scale down the task with the lowest cpu usage.
My use case is that each task is working on processing jobs, when a task finishes all processing jobs it sits idle. At that point I'd like to take that one down. Other tasks will still be working on jobs, I don't want to kill those tasks half way through and make them lose their progress.
My current solution is to have each task report their cpu usage to cloudwatch through cron, then have a lambda task to scale down any tasks which have had low cpu for multiple data points. But this feels like it may be overkill for a seemingly simple problem.
Just adding an update. A year on and we're still using a solution similar to what I described in the final paragraph of my original post. We have a database table listing how many tasks we want and what we want each of them working on, our central master service updates that as needed. These numbers are sent from our DB to cloudwatch, lambda reads from cloudwatch and either spins up new tasks as needed or kills the tasks with the lowest CPU if the task count is too high. It's very hacky but it works, and since there's no official way to do this with fargate the only other alternative would be to switch to something like K8s

For loop in AWS step functions

We have 20 AWS accounts and we create resources in 10 regions in each account. We want to ensure that AWS resources - ELB, AMI and EBS snapshots are properly tagged. We want to have a service that runs periodically to scan the accounts and delete any of the above mentioned resource that is not properly tagged. We want this to be serverless and we were looking at using Lambda. However, there are 2 issues with Lambda:
Lambda timeout - currently it is 5 mins.
Throttling errors
We need to ensure that we process the next account after the first account processing is completed (we could put a hard sleep for a few minutes and then start processing the next account).
Has someone faced a similar scenario and if so, how was it achieved?
Worst case scenario: we will use ECS.
First, can your innermost task complete in under 5 minutes reliably? If so Lambda is a good fit. Your situation looks to be a good fit.
Next, throttling is easily raised by requesting a higher limit through a support ticket.
Finally, try breaking this up into several smaller functions. Maybe something like this:
delete-resource -- Deletes a single untagged resource
get-untagged-resources -- gets untagged resources in an account and invokes "delete-resource" in an async.each loop
get-accounts -- gets list of accounts and invokes "get-untagged-resources" in an async.each loop
I actually prefer having my functions triggered by SNS rather than invoking them directly, but you get the idea. Hope this helps.

Scheduling long-running tasks using AWS services

My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.
Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)