What would be the best way to run a python script on the first of every month?
My situation is I want some data sent to a HipChat room, using the python api, on the first of every month from AWS. The data I want sent is in a text file in a S3 bucket
If your script can execute in under 5 minutes you could do this by creating a Python Lambda function and running it monthly via Lambda scheduled tasks. Running this once a month would stay well within the free Lambda usage limits, so your costs would be almost nothing.
If your script takes longer than 5 minutes to execute then you would probably need to schedule this as a cron job on an EC2 instance.
Create a Lambda function and use cloudWatch ==> Events ==> Rules and configure it
using:
1:AWS built in timers
2:Cron Expressions
In your case cron is better option
Related
Usually, we store our code on github, and then deploy it on AWS lambda.
We are now challenged with a specific Node.js script.
it takes roughly an hour to run, we can't deploy it on a lambda because of that.
it needs to run just once a month.
once in a while we'll update the script in our github repository, and we want the script in AWS to stay in sync if we make changes (e.g. using a pipeline)
this scripts copies files from S3 and processes them locally. It does some heavy lifting with data.
What would be the recommended way to set this up on AWS ?
The serverless approach fits nicely since you will run the work only once per month. Data transfer between Lambda and S3 (in the same region) is free. If Lambda is comfortable for your use case except for execution time constraints and you can "track the progress" of the processing, you can create a state machine that will invoke your lambda as a step function in the loop while you will not process all S3 data chunks. Each lambda execution can take up to 15 minutes and state machine execution time is way beyond 1 hour. Regarding ops, you can have a trigger on your GitHub that will publish a new version of the lambda. You can use AWS CloudFormation, CDK or any other suitable tool for that.
I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html
I have a python script which copy files from one S3 bucket to another S3 bucket. This script needs to be run every Sunday at some specific time. I was reading some of articles and answers, So I tried to use AWS lambda + Cloudwatch events. This files run for minimum 30 minutes. would it be still good with Lambda as Lambda can run max 15 minutes only. Or is there any other way? I can create an EC2 box and run it as a Cron but that would be expensive. Or any other standard way?
The more appropriate way would be to use aws glue python shell job as it is under the serverless umbrella and you'll be charged as you go.
So this way you will only be charged for the time your code runs.
Also you don't need to manage the EC2 for this. This is like an extended lambda.
If the two buckets are supposed to stay in sync, i.e. all files from bucket #1 should eventually be synced to bucket #2, then there are various replication options in S3.
Otherwise look at S3 Batch Operations. You can derive the list of files that you need to copy from S3 Inventory which will give you additional context on the files, such as date/time uploaded, size, storage class etc.
Unfortunately, the lambda 15min execution time is a hard stop so it's not suitable for this use case as a big bang.
You could use multiple lambda calls to go through the objects one at a time and move them. However, you would need a DynamoDB table (or something similar) to keep track of what has been moved and what has not.
Another couple of options would be:
S3 Replication which will keep one bucket in sync with the other.
An S3 Batch operation
Or if its data files? you can always use AWS glue.
You can certainly use Amazon EC2 for a long-running batch job.
A t3.micro Linux instance costs $0.0104 per hour, and a t3.nano is half that price, charged per-second.
Just add a command at the end of the User Data script that will shut down the instance:
sudo shutdown now -h
If you launch the instance with Shutdown Behavior = Terminate, then the instance will self-terminate.
I was writing server with serverless model, currently aws lambda. And have a requirement to run a job on exact datetime
Currently now I was running a cron job with aws cloudwatch to execute my server every minute, find all tasks which has timestamp older than present then do those task. Which is both wasteful and sometimes make a delay or in advanced by one minute from the actual time it need (because cloudwatch has maximum frequency only one ping per one minute). Not a desirable approach
And the work is not the same everyday. It can be dynamic datetime by client to ping my server
I wish there should be some service that like a message queue but can actively call target URL on scheduling timestamp. Is there something like that? It could be any service outside aws if it can put a URL for request
Thank you very much
Have you considered getting small EC2 instance and then set up cron jobs there? It can then publish events to SNS or directly call required tasks. And you should be able to schedule new jobs dynamically as well.
You can use DynamoDB with TTL, DynamoDB Streams and AWS Lambda for this.
Since the schedule is dynamic and coming from the user, you can save those items in a DynamoDB table with its TTL set to the scheduled execution time.
When the TTL is reached for an item, it will create a DynamoDB Stream which you can then use to trigger a Lambda function.
References:
DynamoDB Streams and Time To Live
DynamoDB Streams and AWS Lambda Triggers
As a workaround, why not have the lambda wake on a Cloudwatch alert, then check for tasks every 5 seconds until 55 seconds have elapsed?
You likely already found a solution to this but my service https://posthook.io may be good fit for your use case. It lets you schedule 'hooks' with an API call like this:
curl https://api.posthook.io/v1/hooks \
-H 'X-API-Key: ${POSTHOOK_API_KEY}' \
-H 'Content-Type: application/json' \
-d '{
"path": "/webhooks/ph/event_reminder",
"postAt": "2018-07-03T01:11:55Z",
"data": {
"eventID": 25
}
}'
Then from your lamdba function you can either use the data you passed in as data or the hook's unique ID to look something up in your database and do the needed work. A free account allows you to schedule 500 of these requests a month.
Other solutions seem promising but there are another solution I found
using step functions wait state
http://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
I cannot use it in my region yet because my region is singapore and it cannot be used across region. Currently now I would try to see a dynamodb solution above
As of 2018 step function was generaly available and work as expected
As of 2018 there was Azure Logic Apps. An equivalence service to aws step function on azure. It contains delay connector that can schedule delay time
https://learn.microsoft.com/en-us/azure/connectors/connectors-native-delay
In specific RDS column as a date, I keep the information when user's trials end.
I'm going to check everyday these dates in database and when less the few days lefts to the end of trial, I want send an email message (with SES).
How can I run a periodic tasks in AWS to check database? I know that I can use:
Lambda
EC2 (or Elastic Beanstalk)
Is there any other solution which I've missed?
You can also use AWS Batch for this. This suits better if the job is heavy and takes more time to complete.
How long does it take to run your check? If it takes less than 300 sec and is well within the limits of Lambda (AWS Lambda Limits), then schedule tasks with Lambda: Schedule Expressions Using Rate or Cron
Otherwise, the best option is to use: AWS Data Pipeline. Very easy to schedule and run your custom script periodically. It charges at least one hour of instance.
Go with lamda here
You can create a Lambda function and direct AWS Lambda to execute it on a regular schedule. You can specify a fixed rate (for example, execute a Lambda function every hour or 15 minutes), or you can specify a Cron expression.