I have a relatively large number of tasks that need to be executed at certain intervals, hourly, daily, weekly etc. These tasks are easily defined as AWS Lambda functions and I can schedule them easily enough with AWS Eventbridge.
However, in many cases jobs can fail due to delayed or missing data or other micro services going down. Take, for example, a function that is configured to run every hour and process data from hour X to hour X+1 and serialize to some data store (the ETL use case). Suppose at 1am some service becomes unavailable and the job fails until engineering is able to address the issue at 10am, at which point the code for the lambda is updated.
The desired behavior would be for that job to pick up where it left off and quickly catch up and process data from 1am to 10am (sequentially).
It would be relatively straightforward to implement some state-tracking service manually, where interval success/fails are tracked and can be checked and registered via simple API calls. My question is whether there is existing software for this sort of application/service, as far as I can tell Apache Airflow can do this but it also comes with significantly more complexity and overhead than is needed.
Two options come to mind:
Track state of your application with AWS Step Functions. You can implement coordination between Lambda functions, add parallel or sequential processing etc. Step Functions also support error handling and have built-in retry mechanisms.
Depending on the volume and velocity of data you ingest, you could go with Amazon SQS or Amazon Kinesis to stream the data to Lambda functions. With SQS, you could use retry for every message. If the message couldn't be processed, you can put it into Dead-Letter Queue (DLQ) for further investigation. Also, this approach is highly scalable and allows parallel execution of jobs.
Related
I'm architecting an application solution in AWS and am looking into options AWS has for running one-off jobs to run on a regular schedule.
For example, we have a task that needs to run every 5 minutes that does an API call to an external API, interprets the data and then possibly stores some new information in a database. This particular task is expected to run for 30 seconds or so and will need to be run every 5 minutes. Where this gets a little more complex is we're running a multi-tenant application and this task needs to be performed for each tenant individually. It doesn't satisfy the user's requirements to have a single process do the specified task for each tenant in sequence. The task must be performed every x minutes (sometimes as low as every minute) and it must complete for each tenant as quickly as it takes to perform the task exactly 1 time. In other words, all 200, let's say, tenants must have a task run for them at midnight that each have their task complete in the time it takes to query the API and update the database for one tenant.
To add to the complexity a bit, this is not the only task we will be running on a regular schedule for our tenants. In the end we could have dozens of unique tasks, each running for hundreds of tenants, resulting in thousands or tens of thousands of unique concurrent tasks.
I've looked into ECS Scheduled Tasks which uses CloudWatch Events (which is now the EventBridge) but the EventBridge has a limit of 300 rules per event bus. I think that means we're going to be out of luck if we need to have 10,000 rules (one for each task * the number of tenants), but I'm honestly not sure whether each account gets its own event bus or if that's divided up differently.
In any case, even if this did work, it's still not a very attractive option to me to have 10,000 different rules set up in the EventBridge. At least, it feels like it might be difficult to manage. To that end I'm now more so looking into just creating a single EventBridge rule per event type that will kick off a parent task, that in turn asynchronously kicks off as many asynchronous instances of a child task that is needed, one per tenant. This would limit our EventBridge rules to somewhere around a few dozen. Each one of these, when triggered, would asynchronously spawn a task for each tenant that can all run together. I'm not 100% sure on what type of object this will spawn, it wouldn't be a Lambda since that would easily cause us to hit the 1,000 concurrent Lambda function limit but it might be something like a Fargate ECS task that executes for a few seconds then goes away when it's completed.
I'd love to hear others thoughts on these options, my current direction and any other options I'm currently missing.
You don't necessarily need to look at ECS for this, because 1,000 invocations of a Lambda at a time is only the default concurrency limit. That is something you can request an increase for in the Service Quotas console:
There is no maximum concurrency limit for Lambda functions. However, limit increases are granted only if the increase is required for your use case.
Source: AWS Support article.
Same goes for the 300 rules per event bus limit. That is also a default limit and can be increased upon request in the Service Quotas console.
Since you mentioned branching logic, I wonder if you've looked into AWS Step Functions? In particular, Express Workflows within Step Functions may suit the duration and rates of your executions.
In AWS Glue job, we can write some script and execute the script via job.
In AWS Lambda too, we can write the same script and execute the same logic provided in above job.
So, my query is not whats the difference between AWS Glue Job vs AWS Lambda, BUT iam trying to undestand when AWS Glue job should be preferred over AWS Lambda, especially while when both does the same job? If both does the same job, then ideally I would blindly prefer using AWS Lambda itself, right?
Please try to understand my query..
Additional points:
Per this source and Lambda FAQ and Glue FAQ
Lambda can use a number of different languages (Node.js, Python, Go, Java, etc.) vs. Glue can only execute jobs using Scala or Python code.
Lambda can execute code from triggers by other services (SQS, Kafka, DynamoDB, Kinesis, CloudWatch, etc.) vs. Glue which can be triggered by lambda events, another Glue jobs, manually or from a schedule.
Lambda runs much faster for smaller tasks vs. Glue jobs which take longer to initialize due to the fact that it's using distributed processing. That being said, Glue leverages its parallel processing to run large workloads faster than Lambda.
Lambda looks to require more complexity/code to integrate into data sources (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.) while Glue can easily integrate with these. However, with the addition of Step Functions, multiple lambda functions can be written and ordered sequentially due reduce complexity and improve modularity where each function could integrate into a aws service (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.)
Glue looks to have a number of additional components, such as Data Catalog which is a central metadata repository to view your data, a flexible scheduler that handles dependency resolution/job monitoring/retries, AWS Glue DataBrew for cleaning and normalizing data with a visual interface, AWS Glue Elastic Views for combining and replicating data across multiple data stores, AWS Glue Schema Registry to validate streaming data schema.
There are other examples I am missing, so feel free to comment and I can update.
Lambda has a lifetime of fifteen minutes. It can be used to trigger a glue job as an event based activity. That is, when a file lands in S3 for example, we can have an event trigger which can run a glue job. Glue is a managed services for all data processing.
If the data is very low maybe you can do it in lambda, but for some reason the process goes beyond fifteen minutes, then data processing would fail.
The answer to this can involve some foundational design decisions. What is this job doing? What kind of data are you dealing with? Is there a decision to be made whether the task should be executed in a batch or event oriented paradigm?
Batch
This may be necessary or desirable because the task:
Is being done over large monolithic data (e.g., binary).
Relies on context of multiple records in a dataset such that they must be loaded into a single job.
Order matters.
I feel like just as often I see batch handling chosen by default because "this is the way we've always done it" but breaking from this approach could be worth consideration.
Glue is built for batch operations. With a current maximum execution time of 15 minutes and maximum memory of 10gb, Lambda has become capable of processing fairly large datasets in a single execution, as well. It can be difficult to pin down a direct cost comparison without specifics of the workload. When it comes to development, I feel that Lambda has the edge as far as tooling to build, test, deploy.
Event
In the case where your data consists of a set of records, it might behoove you to parse and "stream" them into Lambda. Consider a flow like:
CSV lands in S3.
S3 event triggers Lambda.
Lambda reads and parses CSV into discrete events, submits to another Lambda or publishes to SNS for downstream processing. Concurrent instances of this Lambda can be employed to speed up ingest, where each instance is responsible for certain lines of the S3 object.
This pushes all logic and error handling, as well as resources required, to the level of individual event/record level. Often mechanisms such as dead-letter queues are employed for remediation. While context of a given container persists across invocations - assuming the container has not been idle and torn down - Lambda should generally be considered stateless such that the processing of an event/record is thought of as occurring within its own scope, outside that of others in the dataset.
I have a AWS Lambda function using an AWS SQS trigger to pull messages, process them with an AWS Comprehend endpoint, and put the output in AWS S3. The AWS Comprehend endpoint has a rate limit which goes up and down throughout the day based off something I can control. The fastest way to process my data, which also optimizes the costs I am paying for the AWS Comprehend endpoint to be up, is to set concurrency high enough that I get throttling errors returned from the api. This however comes with the caveat, that I am paying for more AWS Lambda invocations, the flip side being, that to optimize the costs I am paying for AWS Lambda, I want 0 throttling errors.
Is it possible to set up autoscaling for the concurrency limit of the lambda such that it will increase if it isn't getting any throttling errors, but decrease if it is getting too many?
Very interesting use case.
Let me start by pointing out something that I found out the hard way in an almost 4 hour long call with AWS Tech Support after being puzzled for a couple days.
With SQS acting as a trigger for AWS Lambda, the concurrency cannot go beyond 1K. Even if the concurrency of Lambda is set at a higher limit.
There is now a detailed post on this over at Knowledge Center.
With that out of the way and assuming you are under 1K limit at any given point in time and so only have to use one SQS queue, here is what I feel can be explored:
Either use an existing cloudwatch metric (via Comprehend) or publish a new metric that is indicative of the load that you can handle at any given point in time. you can then use this to set an appropriate concurrency limit for the lambda function. This would ensure that even if you have SQS queue flooded with messages to be processed, lambda picks them up at the rate at which it can actually be processed.
Please Note: This comes out of my own philosophy of being proactive vs being reactive. I would not wait for something to fail to trigger other processes eg invocation errors in this case to adjust concurrency. System failures should be rare and actually raise alarm (if not panic!) rather than being normal that occurs a couple of times a day !
To build up on that, if possible I would suggest that you approach this the other way around i.e. scale Comprehend processing limit and AWS Lambda concurrency based on the messages in the SQS queue (backlog) or a combination of this backlog and the time of the day etc. This way, if every part of your pipeline is a function of the amount of backlog in the Queue, you can be rest assured that you are not spending more than you have at any given point in time.
More importantly, you always have capacity in place should the need arise or something out of normal happens.
The situation
Currently, I am using an Amazon SQS queue that triggers a Lambda function to process new messages upon arrival to the queue. Those Lambda functions are being moved to a DLQ (Dead-Letter Queue) upon failure.
In-order to seed the SQS queue, I am using a CRON that runs every day and inserts the available jobs into the queue.
I want to issue a summarizing alert/email once the processing of all the new jobs the CRON has inserted for the day are done or been processed, along with the details about how many successful, failing and total jobs were originally issued in that day.
The problem:
As the Lambda functions run separately, and the fact that I want to keep it that way, I was wondering what would be the best service to use in order to store the temporary count value (at least two out of the three counts are needed among the total, succeeding and failing counts)?
I was thinking about DynamoDB, but every DB seems to be an overkill for that, and won't be cost-effective either. S3 also doesn't seem to be the most practical/preferred for this type of solution. I can also use SQS (as its "storage" is somewhat designed for cases with relatively small data storage such as these) with an identifier "count" that will be updated by every Lambda function, but knowing which Lambda function was the last requires checking the whole queue, which seems like over-complicating that.
Any other AWS service that comes up to mind?
Here is a good listing of Storage Options in the AWS Cloud (2013, but includes some of that options available today as well).
AWS Systems Manager Parameter Store can be used as a 'mini-database'.
It requires AWS credentials to access (which would be available to the Lambda functions or whatever code you are running to perform this check) but has no operational cost.
From PutParameter - AWS Systems Manager:
Parameter Store offers a standard tier and an advanced tier for parameters. Standard parameters have a content size limit of 4 KB and can't be configured to use parameter policies. You can create a maximum of 10,000 standard parameters for each Region in an AWS account. Standard parameters are offered at no additional cost.
You could run into problems if multiple processes try to update the parameters simultaneously, but hopefully your use-case is pretty simple.
I have several AWS lambda functions triggered by events from other applications, e.g. via Kinesis. Some of this events should trigger something happening at another time. As an example, consider the case of sending a reminder/notification e-mail to a user about something when 24 hours have passed since event X happened.
I have previously worked with lambda functions that schedule other lambda functions by dynamically creating CloudWatch "cron" rules in runtime, but I'm now revisiting my old design and considering whether this is the best approach. It was a bit tedious to set up lambdas that schedule other lambdas, because in addition to submitting CW rules with the new lambda as the target I also had to deal with runtime granting of the invoked lambda permissions to be triggered by the new CW rule.
So another approach I'm considering is to submit jobs to be done by adding them to a database table, with a given execution time, and then have one single CW cron rule running every x minutes that checks the database for due jobs. This reduces complexity of the CW rules (only one, static rule needed), lambda permissions (also static) etc, but adds complexity in an additional database table etc. Another difference is that while the old design only performed one executed one "job" per invocation, this design would potentially execute 100 pending jobs in the same invocation, and I'm not sure if this could cause timeout issues etc.
Did anyone successfully implement something similar? What approach did you choose?
I know there are also other services such as AWS Batch, but this seems overkill for scheduling of simple tasks such as sending an e-mail when time t has passed since event e happened, since to my knowledge it doesn't support simple lambda jobs. SQS also supports timed messages, but only up to 15 minutes, so it doesn't seem useful for scheduling something in 24 hours.
An interesting alternative is to use AWS Step Functions to trigger the AWS Lambda function after a given delay.
Step Functions has a Wait state that can schedule or delay execution, so you can can implement a fairly simple Step Functions state machine that puts a delay in front of calling a Lambda function. No database required!
For an example of the concept (slightly different, but close enough), see:
Using AWS Step Functions To Schedule Or Delay SNS Message Publication - Alestic.com
Task Timer - AWS Step Functions