We have a cron job which runs every 4hrs on an 10MB file and completes processing within 20 minutes.
We are migrating it to NAWS using cloud watch events to triggers lambda/fargate.
I am trying to figure out the best option among these three.
Lambda with concurrency : We can distribute the input file and run lambdas in parallel.
Lambda with daisy chaining : Create different lambda functions for different functionality so that all the components can run in parallel where all the component lambdas are running concurrently as well.
Fargate : I don't know much about Fargate except that is a container based platform, and Fargate might be suitable for long-running scheduled tasks which can be migrated to Fargate, with less refactoring required than for migrating to Lambda.
Any suggestions how to choose? And if I should consider any other options as well
Option 3 is a good fit for your use case. Basically you would write a Docker container to perform the processing, and configure a Cloudwatch event on a cron schedule. I also use Fargate for this purpose and it works great.
Option 1 sounds OK, but sounds like unnecessary extra steps to split the file up.
Option 2 sounds too complex.
You could also look at AWS Batch, but it might be overkill for this task.
Related
Say I have 4 or 5 data sources that I access through API calls. The data aggregation and mining is all scripted in a python file. Lets say the output is all structured data. I know there are plenty of considerations, but from a high level, what would some possible solutions look like if I ultimately wanted to run analysis in BI software?
Can I host the python script in Lambda and set a daily trigger to run the python file. And then have the output stored in RDS/Aurora? Or since the applications I'm running API calls to aren't in AWS, would I need the data to be in an AWS instance before running a Lambda function?
Or host the python script in an EC2 instance, use lambda to trigger a daily refresh that just stores the data in EC2-ESB or Redshift?
Just starting to learn AWS cloud architecture so my knowledge is fairly limited. Just seems like there can be multiple solutions to any problem so not sure if the 2 ideas above are viable.
You've mentioned two approaches which are working. Ultimately it very depends on your use case, budget etc.. and you are right, usually in AWS you will have different solutions that can solve the same problem. For example, another possible solution could be to Dockerize your Python script and run it on containers services (ECS/EKS). But considering you just started with AWS I will focus on the approaches you mentioned as it's probably the most 2 common ones.
In short, based on your description, I would not suggest to go with EC2 because it adds complexity to your use case and moreover extra costs. If you can imagine the final setup, you will need to configure and manage the instance itself, its class type, AMI, your script deployment, access to internet, subnets, etc. Also a minor thing to clarify: you would probably set a cron expression on it to trigger your script (not a lambda reaching the EC2 !). As you can see, quite a big setup for poor benefits (except maybe gaining some experience with AWS ;)) and the instance would be idle most of the time which is far from optimum.
If you just have to run a daily Python script and need to store the output somewhere I would suggest to use lambda for the processing, you can simply have a scheduled event (prefered way is now Amazon EventBridge instead) that triggers your lambda function once a day. Then depending on your output and how you need to process it, you can use RDS obviously from lambda using the Python SDK but you can also use S3 as blob storage if you don't need to run specific queries - for example if you can store your output in json format.
Note that one limitation to lambda is that it can only run for 15 minutes straight per execution. The good thing is that by default lambda has internet access so you don't need to care about any gateway setup and can reach your external endpoints.
Also from a cost perspective running one lambda/day combined with S3 should be free or almost free. The pricing in lambda is very cheap. Running 24/7 an EC2 instance or RDS (which is also an instance) will cost you some money.
Lambda with storage in S3 is the way to go. EC2 / EBS costs add up over time and EC2 will limit the parallelism you can achieve.
Look into Step Functions as a way to organize and orchestrate your Lambdas. I have python code that copies 500K+ files to S3 and takes a week to run. If I copy the files in parallel (500-ish at a time) this process takes about 10 hours. The parallelism is limited by the sourcing system as I can overload it by going wider. The main Lambda launches the file copy Lambdas at a controlled rate but also terminates after a few minutes of run time but returns the last file updated to the controlling Step Function. The Step Function restarts the main Lambda where the last one left off.
Since you have multiple sources you can have multiple top level Lambdas running in parallel all from the same Step Function and each launching a controlled number of worker Lambdas. You won't overwhelm S3 but you will want to make sure you don't overload your sources.
The best part of this is that it costs pennies (at the scale I'm using it).
Once the data is in S3 I'm copying it up to Redshift and transforming it. These processes are also part of the Step Function through additional Lambda Functions.
We are using a task-scheduler to run a series (sequence) of tasks/jobs (SOAP calls) no our site, every night. But, we want to use AWS services such as Step-Function and Lambda to achieve above requirement, as Task-scheduler seems less reliable.
Challenge with Lambda is, 15 min. max timeout. As some of our jobs take more than 1 hour to process, I am having trouble figuring out which service could suffice the request.
I am also looking into AWS Fargate, as an alternative.
Any suggestions/edits are welcome, on which AWS services I could use to run jobs which take up to 1 hour or more.
You could look into AWS Batch which is meant for long running jobs. You can use store your powershell scripts in a Docker container and schedule them to run. They will run until completion.
I have a java application that reads from a SQS queue and does some business processing and finally writes it to a datastore. As the SQS queue grows I want to be able to scale to read more messages and process them. Each SQS message will take about 15 to 20 minutes to process. I was looking at a service like AWS Fargate or AWS Beanstalk to deploy my application. Money is not a concern but usability is. What would be the best platform?
Fargate would be an ideal solution, as it has following advantages over Beanstalk:
It's serverless
More fine-grained control for custom application architectures.
No need to write EB extensions.
Build and Test image locally and Promote same to Fargate.
With application autoscaling, you can scale on the go.
Pricing is per second with a 1-minute minimum
FAQ:
https://aws.amazon.com/fargate/faqs/
Pricing:
https://aws.amazon.com/fargate/pricing/
I've had a very similar use case to this and I used Batch. (which was not available in 2014 when the question was asked)
https://aws.amazon.com/batch/
In my case I was processing audio and video files from the queue.
You can set a lambda to fire on the SQS queue and have that drop the job onto batch for processing.
If you have the minimum cluster size set to zero then you will have no servers running when there is no work to do, but you can have them autoscale up to process as much work as you require when the jobs come in.
The advantage compared to lambda is that the code that executes can be any container with as much resource as you want to throw at it.
For your use case it will be perfect, but for anything that can complete processing in a a few seconds or a minute it's worth making each job process more than one task per execution or all of the time will be spent firing up and shutting down containers.
I have a batch job that I need to run on AWS. I'm wondering what's the best service to use. The job needs to run once a day, so I think that naturally AWS Lambda with a CloudWatch Rule triggering it would do it. However, I'm starting to think that AWS Lambda is thought to be used as a service to handle requests. This AWS official library to integrate Spring-Boot is very oriented to handle HTTP requests, and when creating a lambda via AWS Console, only test cases that send an input to the lambda can be written.
Then, is this a use case for AWS Lambda? Also, these functions can run up to 15 minutes. What should I use if my job needs to run longer?
The purpose of Lambda, as compared to AWS EC2, is to simplify building smaller, on-demand applications that are responsive to events and new information.
If your batch is running within a limit of 15 minutes then you can go with a lambda function.
But if you want batch processing to be done, you should check AWS batch.
Here is nice article which demonstrates the usage of AWS batch.
If you are already using some batch framework like spring-batch, you can also take a look at ECS scheduled task with Fargate.
With ECS Fargate you can launch and stop container services that you need to run only at certain times.
Here are some related articles on Fargate event and scheduled task and Scheduled Tasks.
If you're confident that your function will only run at maximum of 15mins, AWS Lambda could be the solution. Here are the AWS Lambda limits that could help you decide on that.
Also note that lambda has cold start, it's when it will run slower at first but will eventually pick up the pace. Here are some good reads about it that could help you decide on the lambda direction, but feel free to check on any articles that could better explain at your disposal.
This one shows a brief lists that you would like to consider and the factors affecting it.
This one might have a deeper explanation of the cold start with regards to how it works internally.
What should I use if my job needs to run longer?
Depending on your infrastructure, you could maybe explore Scheduled Tasks
Currently I have a single server in amazon where I put all my cronjobs. I want to eliminate this single point of failure, and expose all my tasks as web services. I'd like to expose the services behind a VPC ELB to a few servers that will run the tasks when called.
Is there some service that Amazon (AWS) offers that can run a reoccurring job (really call a webservice) at scheduled intervals? I'd really like to be able to keep the cron functionality in terms of time/day specification, but farm out the HA of the driver (thing that calls endpoints at the right time) to AWS.
I like how SQS offers web endpoint(s), but from what I can tell you cant schedule them. SWF doesn't seem to be a good fit either.
AWS announced support for scheduled functions in Lambda at its 2015 re:Invent conference. With this feature users can execute Lambda functions on a scheduled basis using a cron-like syntax. The Lambda docs show an example of using Python to perform scheduled events.
Currently, the minimum resolution that a scheduled lambda can run at is 1 minute (the same as cron, but not as fine grained as systemd timers).
The Lambder project helps to simplify the use of scheduled functions on Lambda.
λ Gordon's cron example has perhaps the simplest interface for deploying scheduled lambda functions.
Original answer, saved for posterity.
As Eric Hammond and others have stated, there is no native AWS service for scheduled tasks. There are only workarounds and half solutions as mentioned in other answers.
To recap the current options:
The single-instance autoscale group that starts and stops on a schedule, as described by Eric Hammond.
Using a Simple Workflow Service timer, which is not at all intuitive. This case study mentions that JPL used SWF to build a distributed cron, but there are no implementation details. There is also a reference to a code example buried in the SWF code samples.
Run it yourself using something like cronlock.
Use something like the Unreliable Town Clock (UTC) to run Lambda functions on a schedule. Remember that Lambda cannot currently access resources within a VPC
Hopefully a better solution will come along soon.
Introducing Events in AWS Cloudwatch
You can schedule by minute, hourly, days or using CRON expression using console and without Lambda or any programming.
I just scheduled my ASP.net WEB API(HTTP Post) using SNS HTTP endpoint to execute every minute and it's working perfectly.
Is there some service that Amazon (AWS) offers that can run a reoccurring job at scheduled intervals?
This is one of a few single points of failure that people (including me) keep mentioning when designing architectures with AWS. Until Amazon solves it with a service, here's a hack I've published which is actively used by some companies.
AWS Auto Scaling can run and terminate instances using a recurring schedule specified in the cron format.
http://docs.amazonwebservices.com/AutoScaling/latest/APIReference/API_PutScheduledUpdateGroupAction.html
You can have the instance automatically run a process on startup.
If you don't know how long the job will last, you can set things up so that your job terminates the instance when it has completed.
Here's an article I wrote that walks through exact commands needed to set this up:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
Starting a whole instance just to kick off a set of jobs seems a bit like overkill, but if it's a t1.micro, then it only costs a couple pennies.
That t1.micro doesn't have to do the actual work either. Your instance could inject messages into SQS or through SNS so that the other redundant servers pick up the tasks.
This a hosted third party site that can regularly call scheduled scripts on your domain.
This will not work if you need your script to run in the shell, and not as Apache.
Sounds like this might be useful to you:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
Task Runner is a task agent application that polls AWS Data Pipeline
for scheduled tasks and executes them on Amazon EC2 instances, Amazon
EMR clusters, or other computational resources, reporting status as it
does so. Depending on your application, you may choose to:
Allow AWS Data Pipeline to install and manage one or more Task Runner
applications for you on computational resources that it manages
automatically. In this case, you do not need to install or configure
Task Runner as described in this section. This is the recommended
configuration.
Manually install and configure Task Runner on a computational resource
such as a long-running EC2 instance or a physical server. To do so,
use the procedures in this section.
Develop and install a custom task agent instead of Task Runner. The
procedures for doing so will depend on the implementation of the
custom task agent.
Amazon has introducted Lambda last year for NodeJS, yesterday Amazon added the features Scheduled Functions, VPC Support, and Python Support.
By leveraging Scheduled Function - a proper replacement for CRON can be attained.
More Info - http://aws.amazon.com/lambda/details/
As of August 2020, Amazon has moved the Lambda/CloudWatch events to a service called EventBridge (https://aws.amazon.com/eventbridge/). It was launched in July 2019, after most of the answers to this question.
Looks like this is a relatively new option from AWS BeanStalk:
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-periodictasks
Basically, they act like regular SQS receivers, but they're called on a cron schedule instead of in response to a SQS message.
SWF is a Web service from AWS that can be used to schedule tasks. Most of the work goes into specifying what a task and a schedule is.
http://milindparikh.blogspot.com/2015/07/introducing-diksha-aws-lambda-function.html is a scalable scheduler written against SWF.
CloudWatch Events are great, but there is a limit on their number. If you need a scale and willing to sacrifice the precision you could use DynamoDB's TTL as a timer.
The idea is to put items into a DynamoDB table with a TTL set to the time you need to run a task. DynamoDB will delete those items somewhere around the specified time (within 48 hours of expiration). Those deleted items will appear in the DynamoDB stream, associated with a table. A lambda function could listen the stream and take appropriate actions upon the deletions.
Read more in "DynamoDB TTL as an ad-hoc scheduling mechanism" by theburningmonk.com.
The AWS Elastic Load Balancers will ping your instances to check that they're healthy. You can add your cron-like tasks to the script that the ELB is pinging, and it will execute very regularly.
You'd want to add some logic so that each tasks is executed the right amount of times and at the right interval, but this could be accomplished with a database table that tracks executions. Each time the ELB pings your server, your server would check the database to see if any job is pending, and then execute that job.
The ELB will timeout if the script takes too long to execute, so it's important to not create a situation where your ELB health check will take many seconds to process the cron tasks. To overcome this, you can employ the AWS Simple Notification Service. Your ELB health check script can simply publish a message to an SNS topic, and then that topic can deliver the message via an HTTP request to your web server.
In other words:
ELB pings your EC2 instance...
EC2 instance checks for pending jobs and sends a message to SNS if any are found...
SNS notifies your app via HTTP...
The HTTP call from SNS is what actually processes the cron job