AWS Batch vs AWS CodeBuild - amazon-web-services

I am very new in AWS and when I was searching something to download a code from GitHub (a python project), run it, and save the output in s3 the first service that I found was CodeBuild.
So I implement this kind of workflow using CodeBuild.
But now I have seen that AWS have a service called AWS Batch and I am wondering if I should migrate my arquitecture to AWS Batch.
Can you explain which one - AWS CodeBuild or AWS Batch - is more suitable with my case? When use AWS Batch instead of AWS CodeBuild?
Thank very much.

TLDR Summary: AWS Codebuild is the nicer choice for simple jobs.
My (reverse) experience...
I needed to run a simple job that pulls data from external api, read/write to external database, and generate a CSV report.
The job takes ~1 hour to run, so AWS Lambda is out of the picture.
After some googling, I found AWS Batch and decided to give Creating a Simple “Fetch & Run” AWS Batch Job a try.
The required steps to this "simple" job working:
Build a Docker image with the fetch & run script
Create an Amazon ECR repository for the image
Push the built image to ECR
Create a simple job script and upload it to S3
Create an IAM role to be used by jobs to access S3
Configure a compute environment
Create a job queue
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
After spending the time to create all these resources, it did not work out of the box. I found myself debugging random things I shouldn't have to debug such as:
Dockerfile
entrypoint script
ECS cluster
EC2 instance and autoscaling group
After failing to find simple practical examples, and realizing the amount of effort required, I decided to explore other solutions.
I stumbled onto Using AWS CodeBuild to execute administrative tasks and this post.
I've used AWS Codebuild in the past for CI/CD pipeline, and thought "what the hell, lets give it a try". In a shorter amount of time, I was able to get a "codebuild job" running on a cloudwatch scheduler and codebuild slack notifications added, with less effort:
Connect build project to your source code
Select a runtime environment
Create IAM role
Create a buildspec.yml and add runtime commands
One major advantage is that CodeBuild runs tasks in a full-blown Linux environment.
Drawbacks:
Max execution time of 8 hours
AWS Codebuild was much easier to get working for my simple job.
Sorry for the long post, just wanted to share my experience with these 2 services.

AWS Batch is used highly parallel computations, e.g., processing large number of images at the same time:
AWS Batch enables you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources, and AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software.
Thus its not suited for what you are trying to use it. CodeBuild is better choice, based on your description.

Related

Can I improve my setup in AWS for running (machine learning) python scripts in a container when a file is uploaded to S3?

I have a working setup in AWS that looks something like:
The point is that whenever a file is uploaded to S3, it will trigger a lambda that will trigger a Codebuild project. The codebuild project is then based on a docker image (Stored at ECR) and needs to run a few bash command, mainly executing python files in the docker image. That works really well actually.
The files in S3 are updated approximately once a day and each execution in codebuild takes around 4 minutes.
I got the question why I am not using fargate/SageMaker (the scripts are basicly machine learning retraining and predictions). I was just thinking about if there would be any advantages in using Fargate and/or SageMaker for this? Is it e.g. possible to use Fargate and execute bash commands inside the container when triggered?
IIUC, You're wondering the diffenerce between CodeBuild and Fargate/SageMaker
Price
Calcute the price of these three products using the links below.
Pricing Fargate
Pricing SageMaker
Pricing CodeBuild
As your said, you're using the docker image as the main training tools, so maybe the Fargateis more suitable for your scenario.

AWS Batch vs Spring Batch

I have been planning to migrate my Batch processing from Spring Batch to AWS Batch. Can someone give me the reasons to Choose AWS Batch over Spring Batch?
Whilst both these things will play a role in orchestrating your batch workloads a key difference is that AWS Batch will also manage the infrastructure you need to run the jobs/pipeline. AWS Batch lets you to tailor the underlying cloud instances, or specifcy a broad array of instance types that will work for you. And it'll let you make trade-offs: you can task it with managing a bag of EC2 Spot instances for you (for example), and then ask it to optimize time-to-execution over price (or prefer price to speed).
(For full disclosure, I work for the engineering team that builds AWS Batch).
I believe both work at different levels .Spring batch provide framework that reduce the boiler plate code that you need in order to write a batch job.For eg. saving the state of job in Job repository that provide restartability.
On the contrary, AWS batch is an infrastructure framework that helps in managing infra and set some environment variable that help differentiate master node from slave node.
In my opinion both can work together to write a full fledged cost effective batch job at scale on AWS cloud.
Aws Batch is full blown SaaS solution for batch processing,
It has inbuilt
Queue with priority options
Runtime, which can be self managed and auto managed
Job repo, docker images for the job definitions
Monitoring , dashboards, integration with other AWS services like SNS and from SNS to where ever you want
On the other hand, batch is a framework which would still need some of your efforts to manage it all. like employing a queue, scaling is your headache, monitoring etc
My take is , if your company or app is on AWS , go for AWS batch , you will save months of time and get to scalability to a million jobs per day in no time . If you are on-perm or private go for spring batch with some research

AWS Batch vs AWS Step functions for Control M migration

The goal is to migrate our jobs from Control M to AWS, but before I do that I want to better understand the differences between AWS batch and AWS step functions. From what I've understood, AWS step functions seems more encompassing in that I can have one of my steps run AWS batch.
Can you explain the difference between AWS Batch and AWS Step functions? Which is better suited to migrate to from Control M? (Maybe this is preference)
AWS Batch is service to run an offline workload. With Batch, you can easily set up your offline workload using Docker and defining the set of instances types and how many instances will run this workload.
AWS Step Functions is a serverless workflow management service. It only serves you a way to connect to other AWS Services; you cannot run a script in Step Functions itself and you only define the workflow with input/output from other AWS services.
That said, you can use both services to migrate Control M to AWS and possibly other AWS Services like Lambda (for minor workload), SNS (for e-mail) and S3 (for storage).

AWS-Batch vs EC2 vs AWS Workspaces for running batch scripts to load data to Redshift

I have multiple CSV files containing data for different tables, with different file sizes varying from 1 MB to 1.5 GB. I want to process the data (replace/remove values of columns) row by row and then load the data to existing Redshift tables. This is once a day batch processing.
AWS Lambda:
Lambda has limitations of memory, hence I was not able to run process for large CSV files.
EC2: I already have EC2 instance where I am running python script to transform and load the data to redshift.
I have keep EC2 running all the time, which has all python scripts which I want to run for all tables and environment created (installing python, psycopg lib etc), leads to more cost.
AWS Batch:
I created a container image which has all the setup to run the python scripts, and pushed it to ECR.
I then set up AWS Batch job, which can take this container image and run it through ECS.
This is more optimized, I only pay for EC2 used and ECR image storage.
But all the development and unit testing I will have to do on my personal desktop and then push a container, no inline AWS service to test.
AWS Workspaces:
I am not much familiar with AWS Workspaces, but need inputs, can this also be used as aws batch to start and stop instance when required and run python scripts on that, edit or test scripts.
Also, Can I schedule it to run everyday at defined time?
I need a inputs on which service is best suitable, optimized solution for such use-case? Or It would also be great if anyone suggests a better way to use above services I mentioned in better way.
Batch is best suited for your use case. I see that your concern about batch is about the development and unit testing on your personal desktop. You can automate that process using AWS ECR, CodePipeline, CodeCommit and CodeBuild. Setup a pipeline to detect changes made to your code repo, build the image and push it to ECR. Batch can pick up the latest image from there.

Continuous Integration on AWS EMR

We have a long running EMR cluster that has multiple libraries installed on it using bootstrap actions. Some of these libraries are under continuous development and their codebase is on GitHub.
I've been looking to plug Travis CI with AWS EMR in a similar way to Travis and CodeDeploy. The idea is to get the code on GitHub tested and deployed automatically to EMR while using bootstrap actions to install the updated libraries on all EMR's nodes.
A solution I came up with is to use an EC2 instance in the middle, where Travis and CodeDeploy can be first used to deploy the code on the instance. After that a lunch script on the instance is triggered to create a new EMR cluster with the updated libraries.
However, the above solution means we need to create a new EMR cluster every time we deploy a new version of the system
Any other suggestions?
You definitely don't want to maintain an EC2 instance to orchestrate a CI/CD process like that. First of all, it introduces a number of challenges because then you need to deal with an entire server instance, keep it maintained, deal with networking, apply monitoring and alerts to deal with availability issues, and even then, you won't have availability guarantees, which may cause other issues. Most of all, maintaining an EC2 instance for a purpose like that simply is unnecessary.
I recommend that you investigate using Amazon CodePipeline with a Lambda Step Function.
The Step Function can be used to orchestrate the provisioning of your EMR cluster in a fully serverless environment. With CodePipeline, you can setup a web hook into your Github repo to pull your code and spin up a new deployment automatically whenever changes are committed to your master Github branch (or whatever branch you specify). You can use EMRFS to sync an S3 bucket or folder to your EMR file system for your cluster and then obtain the security benefits of IAM, as well as additional consistency guarantees that come with EMRFS. With Lambda, you also get seamless integration into other services, such as Kinesis, DynamoDB, and CloudWatch, among many others, that will simplify many administrative and development tasks, as well as enable you to have more sophisticated automation with minimal effort.
There are some great resources and tutorials for using CodePipeline with EMR, as well as in general. Here are some examples:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/
https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-ecs-ecr-codedeploy.html
https://chalice-workshop.readthedocs.io/en/latest/index.html
There are also great tutorials for orchestrating applications with Lambda Step Functions, including the use of EMR. Here are some examples:
https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://github.com/DavidWells/serverless-workshop/tree/master/lessons-code-complete/events/step-functions
https://github.com/aws-samples/lambda-refarch-imagerecognition
https://github.com/aws-samples/aws-serverless-workshops
In the very worst case, if all of those options fail, such as if you need very strict control over the startup process on the EMR cluster after the EMR cluster completes its bootstrapping, you can always create a Java JAR that is loaded as a final step and then use that to either execute a shell script or use the various Amazon Java libraries to run your provisioning commands. In even this case, you still have no need to maintain your own EC2 instance for orchestration purposes (which, in my opinion, still would be hard to justify even if it was running in a Docker container in Kubernetes) because you can easily maintain that deployment process as well with a fully serverless approach.
There are many great videos from the Amazon re:Invent conferences that you may want to watch to get a jump start before you dive into the workshops. For example:
https://www.youtube.com/watch?v=dCDZ7HR7dms
https://www.youtube.com/watch?v=Xi_WrinvTnM&t=1470s
Many more such videos are available on YouTube.
Travis CI also supports Lambda deployment, as mentioned here: https://docs.travis-ci.com/user/deployment/lambda/