Running thousands of scheduled jobs in AWS on a regular cadence? - amazon-web-services

I'm architecting an application solution in AWS and am looking into options AWS has for running one-off jobs to run on a regular schedule.
For example, we have a task that needs to run every 5 minutes that does an API call to an external API, interprets the data and then possibly stores some new information in a database. This particular task is expected to run for 30 seconds or so and will need to be run every 5 minutes. Where this gets a little more complex is we're running a multi-tenant application and this task needs to be performed for each tenant individually. It doesn't satisfy the user's requirements to have a single process do the specified task for each tenant in sequence. The task must be performed every x minutes (sometimes as low as every minute) and it must complete for each tenant as quickly as it takes to perform the task exactly 1 time. In other words, all 200, let's say, tenants must have a task run for them at midnight that each have their task complete in the time it takes to query the API and update the database for one tenant.
To add to the complexity a bit, this is not the only task we will be running on a regular schedule for our tenants. In the end we could have dozens of unique tasks, each running for hundreds of tenants, resulting in thousands or tens of thousands of unique concurrent tasks.
I've looked into ECS Scheduled Tasks which uses CloudWatch Events (which is now the EventBridge) but the EventBridge has a limit of 300 rules per event bus. I think that means we're going to be out of luck if we need to have 10,000 rules (one for each task * the number of tenants), but I'm honestly not sure whether each account gets its own event bus or if that's divided up differently.
In any case, even if this did work, it's still not a very attractive option to me to have 10,000 different rules set up in the EventBridge. At least, it feels like it might be difficult to manage. To that end I'm now more so looking into just creating a single EventBridge rule per event type that will kick off a parent task, that in turn asynchronously kicks off as many asynchronous instances of a child task that is needed, one per tenant. This would limit our EventBridge rules to somewhere around a few dozen. Each one of these, when triggered, would asynchronously spawn a task for each tenant that can all run together. I'm not 100% sure on what type of object this will spawn, it wouldn't be a Lambda since that would easily cause us to hit the 1,000 concurrent Lambda function limit but it might be something like a Fargate ECS task that executes for a few seconds then goes away when it's completed.
I'd love to hear others thoughts on these options, my current direction and any other options I'm currently missing.

You don't necessarily need to look at ECS for this, because 1,000 invocations of a Lambda at a time is only the default concurrency limit. That is something you can request an increase for in the Service Quotas console:
There is no maximum concurrency limit for Lambda functions. However, limit increases are granted only if the increase is required for your use case.
Source: AWS Support article.
Same goes for the 300 rules per event bus limit. That is also a default limit and can be increased upon request in the Service Quotas console.
Since you mentioned branching logic, I wonder if you've looked into AWS Step Functions? In particular, Express Workflows within Step Functions may suit the duration and rates of your executions.

Related

Software or managed service for AWS Lambda job scheduling

I have a relatively large number of tasks that need to be executed at certain intervals, hourly, daily, weekly etc. These tasks are easily defined as AWS Lambda functions and I can schedule them easily enough with AWS Eventbridge.
However, in many cases jobs can fail due to delayed or missing data or other micro services going down. Take, for example, a function that is configured to run every hour and process data from hour X to hour X+1 and serialize to some data store (the ETL use case). Suppose at 1am some service becomes unavailable and the job fails until engineering is able to address the issue at 10am, at which point the code for the lambda is updated.
The desired behavior would be for that job to pick up where it left off and quickly catch up and process data from 1am to 10am (sequentially).
It would be relatively straightforward to implement some state-tracking service manually, where interval success/fails are tracked and can be checked and registered via simple API calls. My question is whether there is existing software for this sort of application/service, as far as I can tell Apache Airflow can do this but it also comes with significantly more complexity and overhead than is needed.
Two options come to mind:
Track state of your application with AWS Step Functions. You can implement coordination between Lambda functions, add parallel or sequential processing etc. Step Functions also support error handling and have built-in retry mechanisms.
Depending on the volume and velocity of data you ingest, you could go with Amazon SQS or Amazon Kinesis to stream the data to Lambda functions. With SQS, you could use retry for every message. If the message couldn't be processed, you can put it into Dead-Letter Queue (DLQ) for further investigation. Also, this approach is highly scalable and allows parallel execution of jobs.

Serverless Task Scheduling on AWS

So our project was using Hangfire to dynamically schedule tasks but keeping in mind auto scaling of server instances we decided to do away with it. I was looking for cloud native serverless solution and decided to use CloudWatch Events with Lambda. I discovered later on that there is an upper limit on the number of Rules that can be created (100 per account) and that wouldn't scale automatically. So now I'm stuck and any suggestions would be great!
As per CloudWatch Events documentation you can request a limit increase.
100 per region per account. You can request a limit increase. For
instructions, see AWS Service Limits.
Before requesting a limit increase, examine your rules. You may have
multiple rules each matching to very specific events. Consider
broadening their scope by using fewer identifiers in your Event
Patterns in CloudWatch Events. In addition, a rule can invoke several
targets each time it matches an event. Consider adding more targets to
your rules.
If you're trying to create a serverless task scheduler one possible way could be:
CloudWatch Event that triggers a lambda function every minute.
Lambda function reads a DynamoDB table and decide which actions need to be executed at that time.
Lambda function could dispatch the execution to other functions or services.
So I decided to do as Diego suggested, use CloudWatch Events to trigger a Lambda every minute which would query DynamoDB to check for the tasks that need to be executed.
I had some concerns regarding the data that would be fetched from dynamoDb (duplicate items in case of longer than 1 minute of execution), so decided to set the concurrency to 1 for that Lambda.
I also had some concerns regarding executing those tasks directly from that Lambda itself (timeouts and tasks at the end of a long list) so what I'm doing is pushing the tasks to SQS each separately and another Lambda is triggered by the SQS to execute those tasks parallely. So far results look good, I'll keep updating this thread if anything comes up.

Scheduled jobs using AWS Lambda

I have several AWS lambda functions triggered by events from other applications, e.g. via Kinesis. Some of this events should trigger something happening at another time. As an example, consider the case of sending a reminder/notification e-mail to a user about something when 24 hours have passed since event X happened.
I have previously worked with lambda functions that schedule other lambda functions by dynamically creating CloudWatch "cron" rules in runtime, but I'm now revisiting my old design and considering whether this is the best approach. It was a bit tedious to set up lambdas that schedule other lambdas, because in addition to submitting CW rules with the new lambda as the target I also had to deal with runtime granting of the invoked lambda permissions to be triggered by the new CW rule.
So another approach I'm considering is to submit jobs to be done by adding them to a database table, with a given execution time, and then have one single CW cron rule running every x minutes that checks the database for due jobs. This reduces complexity of the CW rules (only one, static rule needed), lambda permissions (also static) etc, but adds complexity in an additional database table etc. Another difference is that while the old design only performed one executed one "job" per invocation, this design would potentially execute 100 pending jobs in the same invocation, and I'm not sure if this could cause timeout issues etc.
Did anyone successfully implement something similar? What approach did you choose?
I know there are also other services such as AWS Batch, but this seems overkill for scheduling of simple tasks such as sending an e-mail when time t has passed since event e happened, since to my knowledge it doesn't support simple lambda jobs. SQS also supports timed messages, but only up to 15 minutes, so it doesn't seem useful for scheduling something in 24 hours.
An interesting alternative is to use AWS Step Functions to trigger the AWS Lambda function after a given delay.
Step Functions has a Wait state that can schedule or delay execution, so you can can implement a fairly simple Step Functions state machine that puts a delay in front of calling a Lambda function. No database required!
For an example of the concept (slightly different, but close enough), see:
Using AWS Step Functions To Schedule Or Delay SNS Message Publication - Alestic.com
Task Timer - AWS Step Functions

How to handle backpressure using google cloud functions

Using google cloud functions, is there a way to manage execution concurrency the way AWS Lambda is doing? (https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html)
My intent is to design a function that consumes a file of tasks and publish those tasks to a work queue (pub/sub). I want to have a function that consumes tasks from the work queue (pub/sub) and execute the task.
The above could result in a large number of almost concurrent execution. My dowstream consumer service is slow and cannot consume many concurrent requests at a time. In all likelyhood, it would return HTTP 429 response to try to slow down the producer.
Is there a way to limit the concurrency for a given Google Cloud functions the way it is possible to do it using AWS?
This functionality is not available for Google Cloud Functions. Instead, since you are asking to handle the pace at which the system will open concurrent tasks, Task Queues is the solution.
Push queues dispatch requests at a reliable, steady rate. They guarantee reliable task execution. Because you can control the rate at which tasks are sent from the queue, you can control the workers' scaling behavior and hence your costs.
In your case, you can control the rate at which the downstream consumer service is called.
This is now possible with the current gcloud beta! You can set a max that can run at once:
gcloud beta functions deploy FUNCTION_NAME --max-instances 10 FLAGS...
See docs https://cloud.google.com/functions/docs/max-instances
You can set the number of "Function invocations per second" with quotas. It's documented here:
https://cloud.google.com/functions/quotas#rate_limits
The documentation tells you how to increase it, but you can also decrease it to achieve the kind of throttling that you are looking for.
You can control the pace at which cloud functions are triggered by controlling the triggers themselves. For example, if you have set "new file creation in a bucket" as trigger for your cloud function, then by controlling how many new files are created in that bucket you can manage concurrent execution.
Such solutions are not perfect though because sometimes the cloud functions fails and get restart automatically (if you've configure your cloud function that way) without you having any control over it. In effect, the number of active instances of cloud functions will be sometimes more than you plan.
What AWS is offering is a neat feature though.

Scheduling long-running tasks using AWS services

My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.
Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)