AWS Batch vs Spring Batch - amazon-web-services

I have been planning to migrate my Batch processing from Spring Batch to AWS Batch. Can someone give me the reasons to Choose AWS Batch over Spring Batch?

Whilst both these things will play a role in orchestrating your batch workloads a key difference is that AWS Batch will also manage the infrastructure you need to run the jobs/pipeline. AWS Batch lets you to tailor the underlying cloud instances, or specifcy a broad array of instance types that will work for you. And it'll let you make trade-offs: you can task it with managing a bag of EC2 Spot instances for you (for example), and then ask it to optimize time-to-execution over price (or prefer price to speed).
(For full disclosure, I work for the engineering team that builds AWS Batch).

I believe both work at different levels .Spring batch provide framework that reduce the boiler plate code that you need in order to write a batch job.For eg. saving the state of job in Job repository that provide restartability.
On the contrary, AWS batch is an infrastructure framework that helps in managing infra and set some environment variable that help differentiate master node from slave node.
In my opinion both can work together to write a full fledged cost effective batch job at scale on AWS cloud.

Aws Batch is full blown SaaS solution for batch processing,
It has inbuilt
Queue with priority options
Runtime, which can be self managed and auto managed
Job repo, docker images for the job definitions
Monitoring , dashboards, integration with other AWS services like SNS and from SNS to where ever you want
On the other hand, batch is a framework which would still need some of your efforts to manage it all. like employing a queue, scaling is your headache, monitoring etc
My take is , if your company or app is on AWS , go for AWS batch , you will save months of time and get to scalability to a million jobs per day in no time . If you are on-perm or private go for spring batch with some research

Related

How to run a schedule job in a AWS?

In my application, I need to run a fargate job(Job1) which loops through a particular task and invokes multiple tasks of fargate Job(Job2). So I want to know what are the possible ways to run this whole operation as a scheduled task? I tried to create ECS cluster with 2 containers and schedule both job1, and job2 using cloud watch events to run. But i was wondering what is the use of AWS Batch? Was is it an alternative for Cloud watch events? Suggest your thoughts please
You could use AWS EventBridge for this task, it uses the same underlying API as CloudWatch Events but with some relevant architectural changes to better implement an event-driven architecture.
Here's the official documentation how to implement a schedule rule, you're looking to use a ECS Target
AWS Batch serves a different purpose than the one from your use case, as per their official documentation:
AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.
What you're trying to do is quite simple, I recommend you keep it simple and don't try to overcomplicate it.

AWS Batch vs AWS CodeBuild

I am very new in AWS and when I was searching something to download a code from GitHub (a python project), run it, and save the output in s3 the first service that I found was CodeBuild.
So I implement this kind of workflow using CodeBuild.
But now I have seen that AWS have a service called AWS Batch and I am wondering if I should migrate my arquitecture to AWS Batch.
Can you explain which one - AWS CodeBuild or AWS Batch - is more suitable with my case? When use AWS Batch instead of AWS CodeBuild?
Thank very much.
TLDR Summary: AWS Codebuild is the nicer choice for simple jobs.
My (reverse) experience...
I needed to run a simple job that pulls data from external api, read/write to external database, and generate a CSV report.
The job takes ~1 hour to run, so AWS Lambda is out of the picture.
After some googling, I found AWS Batch and decided to give Creating a Simple “Fetch & Run” AWS Batch Job a try.
The required steps to this "simple" job working:
Build a Docker image with the fetch & run script
Create an Amazon ECR repository for the image
Push the built image to ECR
Create a simple job script and upload it to S3
Create an IAM role to be used by jobs to access S3
Configure a compute environment
Create a job queue
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
After spending the time to create all these resources, it did not work out of the box. I found myself debugging random things I shouldn't have to debug such as:
Dockerfile
entrypoint script
ECS cluster
EC2 instance and autoscaling group
After failing to find simple practical examples, and realizing the amount of effort required, I decided to explore other solutions.
I stumbled onto Using AWS CodeBuild to execute administrative tasks and this post.
I've used AWS Codebuild in the past for CI/CD pipeline, and thought "what the hell, lets give it a try". In a shorter amount of time, I was able to get a "codebuild job" running on a cloudwatch scheduler and codebuild slack notifications added, with less effort:
Connect build project to your source code
Select a runtime environment
Create IAM role
Create a buildspec.yml and add runtime commands
One major advantage is that CodeBuild runs tasks in a full-blown Linux environment.
Drawbacks:
Max execution time of 8 hours
AWS Codebuild was much easier to get working for my simple job.
Sorry for the long post, just wanted to share my experience with these 2 services.
AWS Batch is used highly parallel computations, e.g., processing large number of images at the same time:
AWS Batch enables you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources, and AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software.
Thus its not suited for what you are trying to use it. CodeBuild is better choice, based on your description.

What is the difference between AWS Elastic MapReduce and AWS Kinesis Data Analytics?

I'm executing a Flink Job with this tools.
I think both can do exactly the same with the proper configuration. Does Kinesis Data Analytics do something that EMR can not do or vice versa?
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time.
Amazon Elastic Map Reduce provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR.
The major difference is maintainability and management from your side.
If you want more independent management and more control then I would say go for AWS EMR. Where its your responsibility to manage the EMR infrastructure as well as the Apache Flink cluster in it.
But if you want less control and more focus on application development and you need to deliver faster(tight deadline) then KDA is the way to go. Here AWS provides all the bells and whistles you need for running your application. This also easily sets up with AWS s3 as code source and provides a bare minimum Configuration Management using the UI.
It scales automatically as well.(Need to understand KCU though).
It provides the same Flink dashboard where you can monitor your application and AWS Cloudwatch integration for debugging your application.
Please go through this nice presentation and let me know it that helps.
Please let me know.
https://www.youtube.com/watch?v=c_LswkrwOvk
I will say one major difference between the two is that Kinesis does not provide a hosted Hadoop service unlike Elastic MapReduce (now EMR)
Got this same question also. This video was helpful in explaining with a real architecture scenario and AWS explanation here tries to explain how Kinesis and EMR can fit together with possible use cases.

Want to clear Big picture about the AWS Glue

I want to clear big picture about the aws Glue regarding some of the following aspects.
How AWS Glue prepare and provision its infrastructure? However it's serverless but how does it manage it?
How it's using apache spark and hadoop to solve so many ETL jobs at a time, Almost jobs of hundreds of AWS Glue customers from every region.
Thanks
AWS Glue uses EMR underneath. It spawns a new cluster with required number of executors (depending on configured DPU) when a new job starts. However, to improve cold start time they have a buffer of already provisioned EMR clusters for the most common number of DPUs. To manage all this they have a set of automated services that monitor state of each cluster, start a new ones etc.

Continuous Integration on AWS EMR

We have a long running EMR cluster that has multiple libraries installed on it using bootstrap actions. Some of these libraries are under continuous development and their codebase is on GitHub.
I've been looking to plug Travis CI with AWS EMR in a similar way to Travis and CodeDeploy. The idea is to get the code on GitHub tested and deployed automatically to EMR while using bootstrap actions to install the updated libraries on all EMR's nodes.
A solution I came up with is to use an EC2 instance in the middle, where Travis and CodeDeploy can be first used to deploy the code on the instance. After that a lunch script on the instance is triggered to create a new EMR cluster with the updated libraries.
However, the above solution means we need to create a new EMR cluster every time we deploy a new version of the system
Any other suggestions?
You definitely don't want to maintain an EC2 instance to orchestrate a CI/CD process like that. First of all, it introduces a number of challenges because then you need to deal with an entire server instance, keep it maintained, deal with networking, apply monitoring and alerts to deal with availability issues, and even then, you won't have availability guarantees, which may cause other issues. Most of all, maintaining an EC2 instance for a purpose like that simply is unnecessary.
I recommend that you investigate using Amazon CodePipeline with a Lambda Step Function.
The Step Function can be used to orchestrate the provisioning of your EMR cluster in a fully serverless environment. With CodePipeline, you can setup a web hook into your Github repo to pull your code and spin up a new deployment automatically whenever changes are committed to your master Github branch (or whatever branch you specify). You can use EMRFS to sync an S3 bucket or folder to your EMR file system for your cluster and then obtain the security benefits of IAM, as well as additional consistency guarantees that come with EMRFS. With Lambda, you also get seamless integration into other services, such as Kinesis, DynamoDB, and CloudWatch, among many others, that will simplify many administrative and development tasks, as well as enable you to have more sophisticated automation with minimal effort.
There are some great resources and tutorials for using CodePipeline with EMR, as well as in general. Here are some examples:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/
https://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-ecs-ecr-codedeploy.html
https://chalice-workshop.readthedocs.io/en/latest/index.html
There are also great tutorials for orchestrating applications with Lambda Step Functions, including the use of EMR. Here are some examples:
https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://github.com/DavidWells/serverless-workshop/tree/master/lessons-code-complete/events/step-functions
https://github.com/aws-samples/lambda-refarch-imagerecognition
https://github.com/aws-samples/aws-serverless-workshops
In the very worst case, if all of those options fail, such as if you need very strict control over the startup process on the EMR cluster after the EMR cluster completes its bootstrapping, you can always create a Java JAR that is loaded as a final step and then use that to either execute a shell script or use the various Amazon Java libraries to run your provisioning commands. In even this case, you still have no need to maintain your own EC2 instance for orchestration purposes (which, in my opinion, still would be hard to justify even if it was running in a Docker container in Kubernetes) because you can easily maintain that deployment process as well with a fully serverless approach.
There are many great videos from the Amazon re:Invent conferences that you may want to watch to get a jump start before you dive into the workshops. For example:
https://www.youtube.com/watch?v=dCDZ7HR7dms
https://www.youtube.com/watch?v=Xi_WrinvTnM&t=1470s
Many more such videos are available on YouTube.
Travis CI also supports Lambda deployment, as mentioned here: https://docs.travis-ci.com/user/deployment/lambda/