Python pipeline on AWS Cloud - amazon-web-services

I have few python scripts which need to be executed in sequence on AWS Cloud so what are the best and simplest options? These script files are proof of concept so little bit dirty also but need to run overnight. Most of the script finishes within 10 mins but couple of them can take up to 1 hour running on a single core.
We do not have any servers like Jenkins, airflow etc...we are planning to use existing aws services.
Please let me know, Thanks.

1) EC2 Instance (Manually controlled)
Upload your scripts to an S3 bucket Use default VPC
launch EC2 Instance
Use SSM Remote session to log in
Run AWS CLI (AWS S3 Sync to download from S3)
Run them Manually
stop instance when done.
To be clean, make a SH file (or master .py file) to do the work. If you want it to stop charging you money afterwards, add command to stop instance when complete.
Least amount of work
2) If you want to run scripts daily
- Script out the work above (include modifying the Autoscale group at end to go to one box)
- Create an EC2 Auto Scale Group and launch it on a CRON job schedule.
It will start up, do the work, and then shut down and stop charging you.
3) Lambda
Pretty much like option 2, but AWS will do most of the work for you.
Either put all your scripts into one lambda..or put each script into its own lambda and have a master that does sync invoke of each script in the order you want.
You have a cloudwatch alarm trigger daily and does the work
I would say that if you are in POC mode, option 1 is best decision. It is likely closest to what you already do where you are currently executing. This is what #jarmod recommended already.

You didn't mention anything about which AWS resources your python scripts need to access or at least the purpose of the scripts, so it is difficult to provide a solution.
However a good option is to use AWS Batch.

Related

How to run cron job only on single instance in AWS AutoScaling?

I have scheduled 2 cronjobs for my application.
My Application server is in an autoscaling group and I kept a minimum of 2 instances because of High availability. Everything working is fine but cron job is running multiple times because of 2 instances in autoscaling.
I could not limit the instance size to 1 because already my application in the production environment I prefer to have HA.
How should I have to limit execute cron job on a single instance? or should i have to use other services like AWS Lamda or AWS ELasticBeanstalk
Firstly you should consider whether running the crons on these instances is suitable. If you're trying to keep this highly available and it is directly interacted via customers what will the impact of the crons performance be?
Perhaps consider using a separate autoscaling group or instance with a total of 1 instances to run these crons? You could launch the instance or update the autoscaling group just before the cron needs to run and then automate the shutdown after it has completed.
Otherwise you would need to consider using a locking mechanism for your script. By using this your script write a lock to confirm that it is in process, at the beginning of the script run it would check whether there was any script lock in progress. To further prevent the chance of a collision between multiple servers consider adding jitter (random seconds of sleep) to the start of your script.
Suitable technologies for writing a lock are below:
DynamoDB using strongly consistent reads.
EFS for a Linux application, or FSX for a Windows application.
S3 using strong consistency.
Solutions suggested by Chris Williams sound reasonable if using lambda function is not an option.
One way to simulate cron job is by using CloudWatch Events (now known as EventBridge) in conjunction with AWS Lambda.
First you need to write a Lambda function with the code that needs to be executed on a schedule. Lambda supports cron expressions.
You can then use Schedule Expressions with EventBridge/CloudWatch Event in the same way as a cron tab and mention the Lambda function as target.
you can enable termination protection on of the instance. Attach necessary role & permission for system manager. once the instance is available under managed instance under system manager you can create a schedule event in cloudwatch to run ssm documents. if you are running a bash script convert that to ssm document and set this doc as targate. or you can use shellscript document for running commands

What is the most efficient way to run scheduled commands on multiple EC2 instances?

Currently working on an environment requirement where we are to push the same file out to multiple EC2 instances running Windows on a scheduled interval. As it stands now, I see a few options and have tried each:
Windows Task Manager: run a basic task on a set schedule invoking the S3 Sync CLI tool
Cons I can see here include: setting up the task on each EC2 instance (there are many).
Lambda: scheduled lambda job that utilizes SSM to run commands on each server in a resource group
Cons: introducing another layer required to execute this task.
Run Command: using an AWS-RunRemoteScript document, run the script (stored in S3) bucket on target instances.
Cons: I'm not positive you can automate these commands on a schedule without adding another layer.
What is the most scalable path forward? Thanks in advance for your help.
Using the Run Command feature of AWS Systems Manager together with either the Maintenance Window feature of AWS Systems Manager or using CloudWatch Events to schedule the execution of Run Command should be useful here.
If you also tag instances appropriately, you can use the tag targeting feature of Run Command to ensure that all instances run the command (including new instances launched in the future as long as they are tagged).
/Mats

What is the easiest way to launch parallel jobs on AWS?

My use case is as follows:
I have a python script which:
1. reads a file from S3
2. processes the file and outputs a new file
3. saves the output file to S3 (or maybe a database)
The python script has some dependencies which are managed via virtualenv.
What is the recommended/easiest way of running these scripts in parallel on AWS?
I see the following options:
AWS Batch: Looks really complicated - I have to build my own Docker container, set up 3 different users, it's not easy to debug.
AWS Lambda: A bit easier to set up, but I still have to wrap my script up into a Lambda function. Debugging doesn't seem too straightforward
Slurm on manually spun up EC2 instances - From a user perspective, this is ideal - all I would have to do is just create a jobs.sbatch file which loads the virtualenv and runs the script. The main drawback is that I have to install and configure slurm.
What is the recommended way of handing this workflow?
Lambda will be suitable for you because you won't have to look at scaling nor get into setting up all the things. About the debugging, you can easily do it using sls wsgi serve
I think you can use a publish/subscribe mechanism by using an SQS queue containing the object key to work on. Then you can have a group of EC2 instances or ECS each subscribing the queue and performing the single operation. With the queue you ensure each process work on a single instance of the problem. I think it is possible to create an auto scaling group in ECS and you probably can change the number of machines to tune the performance/cost.

On AWS, run an AWS CLI command daily

I have an AWS CLI invocation (in this case, to launch a configured EMR cluster to do some steps and then shut down) but I'm not sure how to go about running it daily.
I guess one way to do it is an EC2 micro instance running a cron job, or an ECS task in a micro that launches the command, but that all seems like it might be overkill. It looks like there's also a way to do it in Lambda, but rom what I can tell it'd be kludgy.
This doesn't have to be a good long-term solution, something that's suitable until I can do it right (Data Pipelines) would work just fine.
Suggestions?
If it is not a strict requirement to use the AWS CLI, you can use one of the AWS SDK instead to programmatically invoke Lambda.
Schedule a CloudWatch Rules using cron
When configured, the CloudWatch Rules will trigger a Lambda function
Implement a Lambda function that calls EMR using one of the supported SDKs (e.g. the EMR class in the AWS JavaScript SDK)
Make sure that you have the IAM configuration in place
Full example is available in the Schedule AWS Lambda Functions Using CloudWatch Events
Kludgy? Yes, configuration is needed, however if you take into account the amount of work required to launch EC2 / ECS (and make sure that it re-launches in the event of failure), I'd say it evens out.
Not sure about the whole task that you are doing, but to avoid doing it:
Manually
Avoid another set up for resources in AWS (as you mentioned)
I would create a simple job in a Continuous Integration (CI) server like jenkins,bamboo,circleci ..... (list can go on). I would assume that you might already have a CI server running, why not use it?

Automate AWS instance start and stop

I'm running a instance in amazon AWS and it runs non-stop everyday. I'm using ubuntu ec2 instance which is running Apache, Mirthconnect tool and LAMP server. I want to run this instance only on particular time duration of a day. I prefer not use any additional AWS services such as cloud-watch . Is there a way we could acheive this?.
The major purpose is for using Mirthconnect fetching data from mysql database
There are 3 solutions.
AWS Data Pipeline - You can schedule the instance start/stop just like cron. It will cost you one hour of t1.micro instance for every start/stop
AWS Lambda - Define a lambda function that gets triggered at a pre defined time. Your lambda function can start/stop instances. Your cost will be very minimal or $0
Write a shell script and run it as a cron job or run it on demand. The script will have AWS CLI command to start and stop the instance.
I used Data Pipeline for a long time before moving to Lambda. Data Pipeline is very trivial. Just paste the AWS CLI commands to stop and start instances. Lambda is more involved.
I guess for that you'll need another machine which is on 24x7. On which you can write cron job in python using boto or any other language like bash.
I don't see how you start a instance in stopped state without using any other machine.
Or you can have a simple raspberry pi on at your home which does the ON-OFF work for you using AWS CLI or simple Python. How about that? ;)