Scheduling long-running tasks using AWS services

Scheduling long-running tasks using AWS services - amazon-web-services

My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.

Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)

Related

Software or managed service for AWS Lambda job scheduling

I have a relatively large number of tasks that need to be executed at certain intervals, hourly, daily, weekly etc. These tasks are easily defined as AWS Lambda functions and I can schedule them easily enough with AWS Eventbridge.
However, in many cases jobs can fail due to delayed or missing data or other micro services going down. Take, for example, a function that is configured to run every hour and process data from hour X to hour X+1 and serialize to some data store (the ETL use case). Suppose at 1am some service becomes unavailable and the job fails until engineering is able to address the issue at 10am, at which point the code for the lambda is updated.
The desired behavior would be for that job to pick up where it left off and quickly catch up and process data from 1am to 10am (sequentially).
It would be relatively straightforward to implement some state-tracking service manually, where interval success/fails are tracked and can be checked and registered via simple API calls. My question is whether there is existing software for this sort of application/service, as far as I can tell Apache Airflow can do this but it also comes with significantly more complexity and overhead than is needed.

Two options come to mind:
Track state of your application with AWS Step Functions. You can implement coordination between Lambda functions, add parallel or sequential processing etc. Step Functions also support error handling and have built-in retry mechanisms.
Depending on the volume and velocity of data you ingest, you could go with Amazon SQS or Amazon Kinesis to stream the data to Lambda functions. With SQS, you could use retry for every message. If the message couldn't be processed, you can put it into Dead-Letter Queue (DLQ) for further investigation. Also, this approach is highly scalable and allows parallel execution of jobs.

Calling lambda functions programatically every minute at scale

While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.

In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.

Running multiple jobs in SageMaker

I was wondering if it is possible to run large number of "jobs" (or "pipeline" or whatever is the right way) to execute some modelling tasks in parallel.
So what I planned to do is to do a ETL process and EDA done and after that when the data is ready, I would like to fire 2000 modelling jobs. We have 2000 products and each job can start with a data (SELECT * FROM DATA WHERE PROD_ID='xxxxxxxxx') and my idea is to run these training jobs in parallel (there is no dependency between them - so it makes sense to me).
First of all - 1) Can it be done in AWS SageMaker? 2) What would be the right approach? 3) Any special considerations I need to be aware of?
Thanks a lot in advance!

it's possible to run this on SageMaker, with SageMaker pipelines that will orchestrate a SageMaker Processing job, followed by a Training job. You can define the PROD_ID as a String parameter to the SageMaker Pipeline, then run multiple pipelines executions concurrently (default soft limit is 200 concurrent executions).
As you have a very high numbers of jobs (2K) which you want to run in parallel, and perhaps optimize compute usage, you might also want to look at AWS Batch, which allows you to queue up tasks, for a fleet of instances that starts containers to perform these jobs. AWS Batch also support Spot instances which could reduce your instance cost by 70%-90%. Another advantage of AWS Batch is that jobs reuse the same running instance (only container stop/start), while in SageMaker there's a ~2 minute overhead to start the instance per job. Additionally, AWS Batch also takes care of retries and allowing you to chain all 2,000 jobs together and run a "finisher" job when all jobs have completed.
Limits increase - For any service, you'll need to increase your service quota limits. It can be done from the console "Quotas" for most services, or by contacting AWS support. Some services has hard limits.

Running thousands of scheduled jobs in AWS on a regular cadence?

I'm architecting an application solution in AWS and am looking into options AWS has for running one-off jobs to run on a regular schedule.
For example, we have a task that needs to run every 5 minutes that does an API call to an external API, interprets the data and then possibly stores some new information in a database. This particular task is expected to run for 30 seconds or so and will need to be run every 5 minutes. Where this gets a little more complex is we're running a multi-tenant application and this task needs to be performed for each tenant individually. It doesn't satisfy the user's requirements to have a single process do the specified task for each tenant in sequence. The task must be performed every x minutes (sometimes as low as every minute) and it must complete for each tenant as quickly as it takes to perform the task exactly 1 time. In other words, all 200, let's say, tenants must have a task run for them at midnight that each have their task complete in the time it takes to query the API and update the database for one tenant.
To add to the complexity a bit, this is not the only task we will be running on a regular schedule for our tenants. In the end we could have dozens of unique tasks, each running for hundreds of tenants, resulting in thousands or tens of thousands of unique concurrent tasks.
I've looked into ECS Scheduled Tasks which uses CloudWatch Events (which is now the EventBridge) but the EventBridge has a limit of 300 rules per event bus. I think that means we're going to be out of luck if we need to have 10,000 rules (one for each task * the number of tenants), but I'm honestly not sure whether each account gets its own event bus or if that's divided up differently.
In any case, even if this did work, it's still not a very attractive option to me to have 10,000 different rules set up in the EventBridge. At least, it feels like it might be difficult to manage. To that end I'm now more so looking into just creating a single EventBridge rule per event type that will kick off a parent task, that in turn asynchronously kicks off as many asynchronous instances of a child task that is needed, one per tenant. This would limit our EventBridge rules to somewhere around a few dozen. Each one of these, when triggered, would asynchronously spawn a task for each tenant that can all run together. I'm not 100% sure on what type of object this will spawn, it wouldn't be a Lambda since that would easily cause us to hit the 1,000 concurrent Lambda function limit but it might be something like a Fargate ECS task that executes for a few seconds then goes away when it's completed.
I'd love to hear others thoughts on these options, my current direction and any other options I'm currently missing.

You don't necessarily need to look at ECS for this, because 1,000 invocations of a Lambda at a time is only the default concurrency limit. That is something you can request an increase for in the Service Quotas console:
There is no maximum concurrency limit for Lambda functions. However, limit increases are granted only if the increase is required for your use case.
Source: AWS Support article.
Same goes for the 300 rules per event bus limit. That is also a default limit and can be increased upon request in the Service Quotas console.
Since you mentioned branching logic, I wonder if you've looked into AWS Step Functions? In particular, Express Workflows within Step Functions may suit the duration and rates of your executions.

AWS Lambda temporal load balancing

I have a AWS Lambda function that I invoke with every 1 minute with >1000 SNS events. This is a problem because my account concurrency is set at 3000, so if I start adding more jobs then eventually I'm going to have >3000 concurrent Lambda instances.
Each job takes around 2-5 seconds to complete which means that within each 1 minute window the concurrency limit will only be threatened within the first 5 seconds and I'll have 0 concurrency for the remaining 55 seconds.
If I set a concurrency limit (e.g. 1000) for the lambda will it handle the first 1000 SNS events and then automatically pick up the remainder once the concurrency frees up? And will I only be charged for the actual runtime rather than time spent waiting for concurrency to reduce?
Otherwise, is there a way that AWS will allow me to spread the load of jobs throughout the 1 minute window so that I can invoke the lambda every ~5 seconds with a subset of the total number of jobs?

If I set a concurrency limit (e.g. 1000) for the lambda will it handle the first 1000 SNS events and then automatically pick up the remainder once the concurrency frees up? And will I only be charged for the actual runtime rather than time spent waiting for concurrency to reduce?
Yes. Setting the concurrency limit definitely comes in handy on your use case and is the way to go. This is one of the reasons why concurrency limit actually exists :)
Unfortunately you can't take advantage of batching with SNS because it always sends one and only event. What you could do is to hook up a SQS queue with your SNS topic and have the Lambda function subscribe to the SQS queue instead, then you can take advantage of batching (max batch size is 10), greatly reducing the amount of concurrent Lambda executions, but still, you'd need to set a concurrency limit to make sure you don't use up all the available concurrency.
Otherwise, is there a way that AWS will allow me to spread the load of jobs throughout the 1 minute window so that I can invoke the lambda every ~5 seconds with a subset of the total number of jobs?
No, but this is unnecessary because of the above.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js