I am writing a syncing/ETL app inside AWS. It works as follows:
The source of the data is outside of AWS
Whenever new data is changed/added AWS is alerted via API Gateway (REST)
The REST API triggers a lambda function that does ETL and stores the data in CSV format to S3
This works fine for small tables. However, we are dealing with larger amount of data lately and I have to switch to Fargate (EKS/ECS) instead of lambda. As you can imagine these will be long running jobs and not cheap to perform. Usually when the data is changed in it changes multiple times within a period of 5 minutes, say for example 3 times. So REST API gets a ping 3 times in a row and triggers the ETL jobs 3 times as well. This is very inefficient as you can imagine.
I came up with idea that every time that REST API is triggered lets wait for 5 minutes if the API has not been invoked during the waiting period do ETL otherwise do nothing. I think I can do the waiting using Step Functions. However I cannot find a suitable way to store hash/id of the latest ping to API to one single variable. I thought maybe I can store the hash to an S3 object and after 5 minutes check to see if it is the same as the variable in my step function, but apparently ordinality is not guaranteed. I looked into SQS but the fact that is a FIFO is not very convenient and way more than what I actually need. I am pretty sure that other people have had a similar issue and there must a standard solution for this problem. I could not find any by googling and hence my plea here
Thanks
From what I understand, Amazon DynamoDB is the store you are looking for to save the state of your job.
Also, please note that SQS is not FIFO by default. Using SQS won't prevent you from storing your job state.
What I would do:
Trigger a job and store the state in DynamoDB. Do not further launch job until the job state is done.
Orchestrate the ETL from Step Functions (including the 5 minutes wait)
You can also expire your jobs so DynamoDB will automatically clean them up with time.
Related
We have an AWS Lambda function which queries some data for a client from our DB and sends a report to the client. Some clients want daily reports, some might need weekly or monthly reports. The number of clients can go up ~1000 and each client might have ~10 such reports.
So we are looking for a way to trigger the Lambda function with different parameters based on schedules set by each client.
For Example:
Client A wants daily report of their data to be sent to abc#clienta.com and Client B wants a weekly report of their data to be sent to xyz#clientb.com. So the Lambda function will be invoked twice on Sunday 12 AM (for both clients) and once on Monday-Saturday 12 AM (for Client A).
We found the following solutions on AWS, but both have some limitations.
Approach 1: Use CloudWatch Events
We can create a CloudWatch Events Rule for each client and each report that could trigger our Lambda function on each schedule.
Pros:
Simple setup, easy to implement.
Cons:
There is a limitation of 100 Event Rules per AWS Account. It's mentioned that we can contact AWS to get it increased, but we are not sure if it can be increased to the number we are looking for (Currently it is ~10k, but we would prefer a solution in which there is no such limit). Also, a limit of 100 per account gives an indication that this is not a suitable solution for such a use case.
Approach 2: Using Step Functions
For each client and each report, we can create one AWS State Machine. We can use the Iterator pattern in Step Functions to wait for a day/week/month and then re-invoke the Lambda Function.
Pros:
No limitations on number of State Machines, so this enables us to scale easily.
Cons:
Step Functions have a limitation that they can run for a year, at maximum. This will be a problem in our case because the users will need to get the reports for a much longer period. There is a way to overcome this in Step Functions. Just before it's about to reach the 1-year limit, we can cancel the execution and start a fresh execution. So overall, this solution looks complex.
Can someone suggest a better solution for this on AWS?
Do you really need a CloudWatch for each client? Why not do something like the following architecture.
Have cloudwatch kick off a lambda that checks schedules for all clients each day (or whatever the most frequent report schedule you allow). You don't want this to take a long time so you just have this check a database (i.e. DynamoDB) of schedules and drop metadata about any reports that need to be generated onto an SQS queue (i.e. type of report, client information, destination email). Worst case, this execute and finds nothing to schedule but this should only takes seconds so the cost is very low to just run this everyday.
Then you have a lambda that actually does the report generator and email that consumes the queue. This report generator lambda will scale and spin up as many instances it needs to handle the messages on the queue. You can set the concurrency limit for the report generator lambda to ensure it doesn't spin up too many at a time if that is a concern once you are having 1000s of clients.
The definition and deployment of all these components can easily be automated via an AWS SAM.
Hope this alternate approach gives you a few more ideas.
You can combine both approach, to get the best result.
step 1: Use stepfunction to run your lambdas.
step 2: Trigger your stepfunction from cloudwatch, based on stepfunction event(SUCCESS,FAILED ETC).
In this way when step 1 fails or completes 1 year run. Cloudwatch event can trigger it back on, based on the json input you pass.
Preamble: I have a web app, the backend is based on the serverless architecture. It's basically an amplify app hosted on AWS with a dynamoDB database. I've learnt is possible to create a task scheduling system of sorts more here. A quick summary of the article is "Its possible to create a task scheduling table taking advantage of TTL and dynamoDB streams to execute lambda function at specific times. The TTL specifies a set time for an record to be deleted, we can capture this delete event in a dynamoDB stream and run some tasks based on information from the stream"
Problem:
The goal is to send a series of emails to users who sign up for our service. Each user that signs up gets a series of "Getting Started" emails. The first of the emails is sent 24 hours after a user signs up, the second 3 days later and the third exactly 7 days after sign up.
I see how a cron job would be suitable here, but it just seems a bit inefficient to me. I would basically have to search the users table for users whose sign up time falls between a specific 24 hour period and send the email to the users whereas with a Task scheduler table I could add a task to the table ( something like send first email to user300 with a TTL of when I want it to be sent ) and listen for delete events to run the task. No need to run a cron job daily, just a function that handles each task as it comes.
I think this is more like a performance vs storage problem. Having a task scheduler table would take up space, if we add all the emails to be sent to a user as tasks on the table (each email to be sent to a specific user is it's own task) each time a user signs up then I see the task scheduler table growing 3n records for every n user signed up. But this may not really be a problem as tasks are deleted after they are run. I do not know the performance cost of using a cron job for this particular task hence I'm here. I also may be wrong and the cost of running and updating this task scheduler table may be more than that of the cron job.
I initially thought of setting up a dummy user table and running both the cron and the task scheduler and documenting cost of running both, but you can imagine how much time and effort that would take.
So I guess my question is which is a more efficient solution in terms of performance and cost?
There is no perfect solution here. Keep in mind that Dynamodb TTL takes up to 48h to invoke, so it's probably unacceptable. CRON Jobs with Lambda are cheap, and it's easy to set. You coul also use SQS and populate it with daily CRON. Yan Cui wrote great article about this problem https://theburningmonk.com/2019/03/dynamodb-ttl-as-an-ad-hoc-scheduling-mechanism/
This may not exactly be an answer. Based on the medium article you linked the guy had a plausible reason why the TTL and dynamoDB streams would be better than a cron job which you reiterated. Setting up a cron job is easier and cheaper (free) and I doubt the performance will be that much worse unless the database is huge. I don't have any experience doing something like this so I wouldn't know how large the database would have to be for it to make sense to switch over. Alternatively, you can have as many cron jobs as you want so I don't see how you couldn't just set up a user specific cron job whenever someone signs up.
You can setup a CloudWatch Event to fire a Lambda function on a regular schedule. The Lambda function can search a database for an applicable result set and perform other actions - send an email, a text message, etc.
Here is an AWS tutorial that covers a very similar use case with step by step instructions. This tutorial is implemented by using the AWS Java API (but you can implement it using other supported programming languages).
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/usecases/creating_scheduled_events
From a Cost perspective - Lambda allows 1M free requests per month. Details are here - https://aws.amazon.com/lambda/pricing/
I use NodeJs AWS Lambdas. If I don't do calls to my S3, or DynamoDB, or KMS for some time (approx. 8h or more) the first call I make is usually painfully slow - up-to 5sec. There's nothing complex in the queries themselves - i.e. get a 0.2Kb S3 object, query a DynamoDB table by index.
So, it looks like AWS "hibernates" these resources when they aren't in active use and when I call them for the 1st time after a while they spend some time to return from "hibernated" state. This is my assumption, but I couldn't find any information about it in docs. So, the questions are the following two:
Is my assumption about "hibernation" correct?
If 1st point is correct, then is there any way to mitigate these "cold" calls to AWS services except keeping those services "warm" by calling them every X minutes?
Edit
Just to avoid confusions - this is not about Lambda's cold starts. I'm aware of them, they exist, they have their own share in functions' latency. Times I measure are the exact times of calls to S3/DynamoDB etc. - after the lambda is started.
It all likelihood it is the lambda function that is hibernating, not the other services:
A cold start occurs when an AWS Lambda function is invoked after not
being used for an extended period of time resulting in increased
invocation latency.
https://medium.com/#lakshmanLD/resolving-cold-start%EF%B8%8F-in-aws-lambda-804512ca9b61
and yes, you could setup a cloudwatch event to keep your lambda function warm.
We have experienced the same issue for calls to SSM and DynamoDB. It's probably not these services that go into hibernation, but the parameters for calling them are cached on the lambda container, which means they need to be recreated when a new container is spawned.
Unfortunately, we have not found a solution other than pinging the lambda from time to time. In this case, you should execute a call to your services in the ping in order to see an improvement in the loading times. See also below benchmark.
AWS (zoewangg) acknowledged the slow startup issue in 1.11.x Java SDK1.
One of the main reasons is that 1.11.x SDK uses ApacheHttpClient under
the hood and initializing it can be expensive.
Check out https://aws.amazon.com/blogs/developer/tuning-the-aws-java-sdk-2-x-to-reduce-startup-time/
I have an application on an AWS EC2 instance that runs once daily. The application fetches some files from a web service, parses the files line by line, updates a database, updates S3 files based on changes in the database, sends notification emails to customers as well as a few other tasks.
This is a series of logical tasks that must take place in sequence, although some of the tasks can be thought of as sub-tasks that can be executed in parallel. All tasks are a combination of Perl scripts and Java programs, with a single Perl script acting as the manager that executes each in turn. Some tasks can take as long as 45 minutes to complete, and the whole process can take up to 3 hours in total.
I'd like to make this whole process serverless. My initial idea was to use AWS Lambda, whereby each task would execute as a Lambda function, until I discovered Lambda functions impose a 5 minute execution timeout. It seems like the AWS Step Functions service is actually a better fit for my use case, but my understanding is that this service is backed by Lambda, so the tasks will still have the 5 min execution limitation.
(I'm also aware that I would have to re-write my Perl scripts to a language supported by Lambda).
I assume that I can work around the execution time limit by refactoring my code into smaller functions that will guarantee to complete in under 5 minutes. In my particular situation though, this seems inefficient.
Currently the database update task processes lines from a file one at a time. For this to work with Lambda, a Lambda function would need to handle only a single line from the file (or a very small number of lines) in order to guarantee not spilling over 5 minutes execution time. This would involve opening and closing a connection with the database on every invocation of the Lambda function. Also, each line processed should result in an entry written to a file, to be stored in S3. Right now, I just keep a file handle in memory and write the file to S3 when all lines are processed, but with Lambda I would need to keep reading the file, updating it and writing it back to S3.
What I'm asking is:
Is my use case a bad fit for AWS Lambda and/or AWS Step Functions?
Have I misunderstood how these services work?
Is there another AWS service that would be a better fit for my use case?
After further research, I think AWS Batch might be a good idea.
What you want are called Activity Workers. Tl;dr: You register "activities" and each gets an ARN. Then you can put that ARN in the resource field of Task states and then you run some code (the "worker") somewhere (in a Lambda, on EC2, in your basement, wherever) that polls for tasks identified by that ARN, then calls back to report success or failure. Activity Workers can run for up to a year.
Step-by-step details at the AWS docs
In response to RTF's comment, here's a deeper dive: Suppose you have code to color turtles in color_turtles.pl. So what you do is call the CreateActivity API - see http://docs.aws.amazon.com/step-functions/latest/apireference/API_CreateActivity.html - giving the name "ColorTurtles" and it'll give you back an ARN, a string beginning arn:aws... Then in your state machine you make a Task state with that ARN as the value of the resource field. Then you add code to color_turtles.pl to poll the service with http://docs.aws.amazon.com/step-functions/latest/apireference/API_GetActivityTask.html - whenever a machine you're running gets to that task, it'll go look for activity workers polling. It'll give your polling worker the input for the task, then you process the input and generate some output, and call SendTaskSuccess or SendTaskFailure. All these are just REST HTTP calls, so you can run them anywhere and I mean anywhere; in a Lambda, on an EC2 instance, or on some computer anywhere on the Internet.
So to answer your questions:
1) Yeah, if you've got something that'll run for around 45 minutes, whilst you could engineer it with Lambda/Step functions you're probably better off getting a EC2 micro instance.
2)Nope you've pretty much got it.
3) As above you want to go with EC2 for this, there's a good article on using Data Pipelines to start / stop an EC2 instance here that way by starting instance only when you need it the cost(if any) is negligible.
I have jobs that run in this fashion normally you can get away with with a t2.micro instance which is free tier eligible.
You can also run your perl scripts on an EC2 instance so no need to rewrite them!
I will start with that it seems you are looking for workflow solutions on AWS. SWF and Step functions are the two most popular ones. Steps function is more recent offering and encouraged by AWS more than SWF.
SWF has native capability to handle long-running tasks, the downside is that you have to provide your own execution environment for deciders (can't use lambda).
With step functions, you can do this in two different ways. One of the approaches is suggested by Tim in his answer. There is an alternative way to achieve the same which is using job poller in step functions. Job pollers have the ability to call (poll) your resource and find out if the task is done and if not you can send execution in wait mode for the specified time. As mentioned above maximum execution time allowed currently for any workflow is 1 year. In case you have tasks which may take longer than 1 year, you can't use step functions in its current form.
I would like to use AWS Lambda as a social media post scheduler, but I can't find an elegant way to do so. In our app, users create social media posts and set a time. We then post them via the social network's API at the time specified.
I need to be able to schedule a Lambda to run once at a scheduled time and with unique data (being the user's token and the body of the post) in order to accomplish it. Here's an example:
John wants to post to Twitter next Thursday at 2pm. He's scheduled a
post with the body "Hello world!" for that time via our web app. The
app will talk to AWS Lambda via the API and set a Lambda function to
fire one time next Thursday at 2pm. That function would fire a request
to the Twitter API with John's token and the body ("Hello world!").
Would love to be able to do this serverless with Lambda, but I can't find a great way. If you could pair a Cloudwatch scheduled event trigger with a unique payload, that might work, but I don't see that it's possible. Otherwise, it seems this would require creating a new Lambda function for each post with the data hard-coded or having the Lambda hit the database to look for the scheduled post. Creating potentially hundreds of bespoke Lambda functions seems like a huge mess, and hitting the database at Lambda runtime seems like undue stress on the database since we have all the data we need in-hand at the time we schedule.
Any suggestions for how I might accomplish this with Lambda? Is there another AWS service that is better suited to the task? Should I give up on serverless and just set up another EC2 instance to handle the scheduler?
You definitely don't want to be creating a function + event per scheduled task.
The scalable way to do this would be to schedule a single function to run regularly (e.g. hourly) and check a database to see if any posts where scheduled for the last hour (i.e. since the last run), and perform them if so.
The reason I am suggesting a database is because you need to manage your state (that is, the post payload/details) somewhere, and relying on CloudWatch Events for this is not the right way, for all the reasons you've listed in your question.
An alternative to a database would be to put the payload in S3, and have the scheduled function check a specific location/bucket for the payloads that need processing. Lambda to S3 communication is very fast, and you don't need to worry about load or network transfers.
You could use AWS Step Functions for this task. With these you can model a state machine which waits for the exact timestamp to trigger.
https://aws.amazon.com/step-functions/
The only drawback of those is, that documentation is still pretty scarce, but if you log into the AWS console, they provide some samples how to implement those wait processes.