Scheduling one time tasks with AWS - amazon-web-services

I have to implement functionality that requires delayed sending of a message to a user once on a specific date, which can be anytime - from tomorrow till in a few months from now.
All our code is so far implemented as lambda functions.
I'm considering three options on how to implement this:
Create an entry in DynamoDB with hash key being date and range key being unique ID. Schedule lambda to run once a day and pick up all entries/tasks scheduled for this day, send a message for each of them.
Using SDK Create cloudwatch event rule with cron expression indicating single execution and make it invoke lambda function (target) with ID of user/message. The lambda would be invoked on a specific schedule with a specific user/message to be delivered.
Create a step function instance and configure it to sleep & invoke step with logic to send a message when the right moment comes.
Do you have perhaps any recommendation on what would be best practice to implement this kind of business requirement? Perhaps an entirely different approach?

It largely depends on scale. If you'll only have a few scheduled at any point in time then I'd use the CloudWatch events approach. It's very low overhead and doesn't involve running code and doing nothing.
If you expect a LOT of schedules then the DynamoDB approach is very possibly the best approach. Run the lambda on a fixed schedule, see what records have not yet been run, and are past/equal to current time. In this model you'll want to delete the records that you've already processed (or mark them in some way) so that you don't process them again. Don't rely on the schedule running at certain intervals and checking for records between the last time and the current time unless you are recording when the last time was (i.e. don't assume you ran a minute ago because you scheduled it to run every minute).
Step functions could work if the time isn't too far out. You can include a delay in the step that causes it to just sit and wait. The delays in step functions are just that, delays, not scheduled times, so you'd have to figure out that delay yourself, and hope it fires close enough to the time you expect it. This one isn't a bad option for mid to low volume.
Edit:
Step functions include a wait_until option on wait states now. This is a really good option for what you are describing.

As of November 2022, the cleanest approach would be to use EventBridge Scheduler's one-time schedule.
A one-time schedule will invoke a target only once at the date and time that you specify using a valid date, and a timestamp. EventBridge Scheduler supports scheduling in Universal Coordinated Time (UTC), or in the time zone that you specify when you create your schedule. You configure a one-time schedule using an at expression.
Here is an example using the AWS CLI:
aws scheduler create-schedule --schedule-expression "at(2022-11-30T13:00:00)" --name schedule-name \
--target '{"RoleArn": "role-arn", "Arn": "QUEUE_ARN", "Input": "TEST_PAYLOAD" }' \
--schedule-expression-timezone "America/Los_Angeles"
--flexible-time-window '{ "Mode": "OFF"}'
Reference: Schedule types on EventBridge Scheduler - EventBridge Scheduler
User Guide

Instead of using DynamoDB I would suggest to use s3. Store the message and time to trigger as key value pairs.
S3 to store the date and time as key value store.
Use s3 lambda trigger to create the cloudwatch rules that would target specific lambda's etc
You can even schedule a cron to a lambda that will read the files from s3 and update the required cron for the message to be sent.
Hope so this is in line with your requirements

Related

Scheduling a reminder email in AWS step function (through events/SES) based on Dynamo DB attributes

I have a step function with 3 lambdas, the last lambda is basically writing an entry in the dynamo DB with a timestamp, status = "unpaid" (this is updated to "paid" for some automatically based on another workflow), email and closes the execution. Now I want to schedule a reminder on any entry in the DynamoDB which is unpaid & over 7 days, a second reminder if any entry is unpaid over 14 days, a third last reminder on 19th day - sent via email. So the question is:
Is there any way to do this scheduling per Step function execution (that can then monitor that particular entry in ddb for 7, 14, 19 days and send reminders accordingly until the status is "unpaid").
If yes, would it be too much overhead since there could be millions of transactions.
The second way which I was thinking was to build another scheduler lambda sequence: the first lambda basically parsing through the whole ddb searching for entries valid for reminder (either 7, 14, 19). The second lambda getting the list from the first lambda and prepares the reminder based on whether its first, second or third (in loop) & the third Lambda one sending the reminder through SES.
Is there a better or easier way to do this?
I know we can trigger step functions or lambdas through cloud events or we also have crons that we can use but they were not suiting the use case much.
Any help here is appreciated?
DynamoDB does not have functionality for a delayed notification based on logic, you would need to design this flow yourself. Luckily AWS has all the tools you need to perform this.
I believe the best option would probably be to create a CloudWatch Events/EventBridge when the item is written to DynamoDB (either via your application or as a trigger via a Lambda using DynamoDB Streams).
This event would be scheduled for 7 days time, in the 7 days any checks could be performed to validate if it has been paid or not. If it has not been paid you schedule the next event and send out the notification. If it had been paid you would simply exit the Lambda function. This would then continue for the next 2 time periods.
You could then further enhance this by using DynamoDB streams so that in the event of the DynamoDB table being updated a Lambda is triggered to detect whether status has changed from unpaid. If this occurs simply remove the event trigger to prevent it even having to process.

How to AWS trigger lambda on a specific time or date for n number of tasks?

I have a scenario when i will have set of tasks to be executed at specific timing.
for example
task1: 28-06-2020 1:00 AM
task2: 30-06-2020 2:00 AM
task3: 01-07-2020 12:00 PM
.
.
.
n
i want to trigger my lambda(where me logic is defined), at these specified timing.
Probably i would be storing my timings to execute in a database,
can some tell me a way to execute lambda at a specified time.
I know we have TTL mechanism in dynamo which can trigger lambda but it delays the execution by 48 hours.I want my lambda to execute at the precise timings
You can use CloudWatch Events cron expressions for specific dates to execute only once. You would have to create rules for each date in question. This is based on the assumption that there is no regular pattern to repeatability of the dates.
The rules would trigger your lambda at these specific dates.
For example, for your dates in the question, you could use:
28-06-2020
30-06-2020
Given that you will have potentially 1000+ events at various times of day, you will need to implement your own solution. I would recommend:
Store events in a database (eg date, time, repetition pattern)
Use AWS CloudWatch Events to trigger an AWS Lambda function every minute
Code the Lambda function to:
Query the database for unprocessed events that are due (or past-due)
Invoke the appropriate Lambda function
Delete the event from the database, or mark it as processed (for repeating events, store a 'last processed' time)
Functions will potentially be invoked a few seconds late due to these processing steps, but that should be fine unless you need high-precision timing.
Steps Functions as an ad-hoc scheduler could be a good option for the use-case.
Query the database and Schedule execution for the specific date/time in Step Function state machine
In Step Function execution, map the lambda that needs to be triggered
Once the lambda is triggered at the desired time, the required business functionalities can be implemented.
References:
https://medium.com/serverless-transformation/serverless-event-scheduling-using-aws-step-functions-b4f24997c8e2
https://meetrix.io/blog/aws/06-using-step-functions-to-schedule-your-lambda.html
https://blog.smirnov.la/step-functions-as-an-ad-hoc-scheduling-mechanism-ed1787e44bb1

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

aws dynamodb stream lambda processes too quickly

I have DynamoDb table that I send data into, there is a stream that is being processed by a lambda, that rolls up some stats and inserts them back into the table.
My issue is that my lambda is processing the events too quickly, so almost every insert is being sent back to the dynamo table, and inserting them back into the dynamo table is causing throttling.
I need to slow my lambda down!
I have set my concurrency to 1
I had thought about just putting a sleep statement into the lambda code, but this will be billable time.
Can I delay the Lambda to only start once every x minutes?
You can't easily limit how often the Lambda runs, but you could re-architect things a little bit and use a scheduled CloudWatch Event as a trigger instead of your DynamoDB stream. Then you could have the Lambda execute every x minutes, collate the stats for records added since the last run, and push them to the table.
I never tried this myself, but I think you could do the following:
Put a delay queue between the stream and your Lambda.
That is, you would have a new Lambda function just pushing events from the DDB stream to this SQS queue. You can set an delay of up to 15 minutes on the queue. Then setup your original Lambda to be triggered by the messages in this queue. Be vary of SQS limits though.
As per lambda docs "By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires.", using this you can add a bit of a delay, maybe process the batch sequentially even after receiving it. Also, since execution faster is not your priority you will save cost as well. Less lambda function invocations, cost saved by not doing sleep. From aws lambda docs " You are charged based on the number of requests for your functions and the duration, the time it takes for your code to execute."
No, unfortunately you cannot do it.
Having the concurrency set to 1 will definitely help, but won't solve. What you could do instead would be to slightly increase your RCUs a little bit to prevent throttling.
To circumvent the problem though, #bwest's approach seems very good. I'd go with that.
Instead of putting delay or setting concurrency to 1, you can do the following
Increase the batch size, so that you process few events together. It will introduce some delay as well as cost less money.
Instead of putting data back to dynamodb, put it to another store where you are not charged by wcu but by amount of memory/ram you are using.
Have a cloudwatch triggered lambda, who takes data from this temporary store and puts it back to dynamodb.
This will make sure few things,
You can control the lag w.r.t. staleness of aggregated data. (i.e. you can have 2 strategy defined lets say 15 mins or 1000 events whichever is earlier)
You lambda won't have to discard the events when you are writing aggregated data very often. (this problem will be there even if you use sqs).

Invoke AWS Lambda function only once, at a single specified future time

I want to be able to set a time to invoke an AWS Lambda function, then have that function be invoked then and only then. For example, I want my Lambda function to run at 9:00pm on December 19th, 2017. I don't want it to repeat, I don't want it to invoke now, just at 9:00pm on the 19th.
I understand that CloudWatch provides Scheduled Events, and I was thinking that when a time to schedule this reminder for is inputted, a CloudWatch Scheduled Events is created to fire in that amount of time from now (so like if you schedule it at 8:22pm to run at 9pm, it’ll be 38 mins), then it invokes the Lambda function at 9pm which then deletes the CloudWatch Scheduled Event. My issue with this is that when a CloudWatch Scheduled Event is created, it executes right then, then at the specified interval.
Any other ideas would be appreciated, as I can't think of another solution. Thanks in advance!
You can schedule lambda event using following syntax:
cron(Minutes Hours Day-of-month Month Day-of-week Year)
Note: All fields are required and time zone is UTC only
Please refer this AWS Documentation for Details.
Thanks
You can use DynamoDB TTL feature to implement this easily, simply do the following:
1- Put item with TTL, the exact time you want to execute or invoke a lambda function.
2- Configure DynamoDB Streams to trigger a lambda function on item's remove event.
Once the item/record is about to expire, your lambda will be invoked. you don't have to delete or cleanup anything as the item in dynamodb is already gone.
NOTE: However the approach is easy to implement and scales very well, but there's one precaution to mention; using DynamoDB TTL as a scheduling mechanism cannot guarantee exact time precision as there might be a delay. The scheduled tasks are executed couple of minutes behind.
You can schedule a step function which can wait until a specific point in time before invoking the lambda with an arbitrary payload.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
Something like this
const stepFunctions = new AWS.StepFunctions()
const payload = {
stateMachineArn: process.env.SCHEDULED_LAMBDA_SF_ARN,
name: `${base64.encode(email)}-${base64.encode(timestamp)}`, // Dedupe key
input: JSON.stringify({
timestamp,
lambdaName: 'myLambdaName',
lambdaPayload: {
email,
initiatedBy
},
}),
}
await stepFunctions.startExecution(payload).promise()
I understand its quite late to answer this question. But anyone who wants to use CRON expression to trigger an event(or call an API) only once can use following example:
This event will be triggered only once on January 1, 2025 - 12:00:00 GMT
00 12 01 01 ? 2025
For those who do not have much knowledge of cron syntax:
Minutes Hours DayOfMonth Month DayOfWeek Year
I am using this with AWS Cloudwatch Events and the result looks like this:
Note: I did not have to specify Day of week, since I have given it a fixed date and thats obvious.
Invoking a lambda function via Events is asynchronous invocation option. By using CloudWatchEvent to trigger Lambda function we can use cron job but we are still facing issues as Lambda function triggered multiple times for the same cron schedule.PFB Link:
https://cloudonaut.io/your-lambda-function-might-execute-twice-deal-with-it/
But this needs Dynamo DB to be implemented in your account and then make your Lambda function Idempotent.