I want to clear big picture about the aws Glue regarding some of the following aspects.
How AWS Glue prepare and provision its infrastructure? However it's serverless but how does it manage it?
How it's using apache spark and hadoop to solve so many ETL jobs at a time, Almost jobs of hundreds of AWS Glue customers from every region.
Thanks
AWS Glue uses EMR underneath. It spawns a new cluster with required number of executors (depending on configured DPU) when a new job starts. However, to improve cold start time they have a buffer of already provisioned EMR clusters for the most common number of DPUs. To manage all this they have a set of automated services that monitor state of each cluster, start a new ones etc.
Related
I am trying to setup a data pipeline in AWS hopefully using serverless and hosted service.
However, one of the steps require large amount of ram (120GB) which cannot be broken down into smaller chunks.
Ideally I would also run the steps as containers since the packages requirements are a bit exotic.
So far it seems like neither AWS Glue nor MWAA handles more than 32GB of ram.
The one that does handle it is AWS data pipeline, which is being deprecated.
Am I missing some (hosted) options? Otherwise I know that I can do things like running Flyte on managed k8s.
Regards,
Niklas
For such use case where you require a containerized approach and you prefer it to be serverless, you can check out EMR Serverless:
Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR
Serverless provides a serverless runtime environment that simplifies
the operation of analytics applications that use the latest open
source frameworks, such as Apache Spark and Apache Hive. With EMR
Serverless, you don’t have to configure, optimize, secure, or operate
clusters to run applications with these frameworks.
EMR Serverless helps you avoid over- or under-provisioning resources
for your data processing jobs. EMR Serverless automatically determines
the resources that the application needs, gets these resources to
process your jobs, and releases the resources when the jobs finish.
Additionally, you can build your own containers with custom images that contain your specific package requirements.
And a note: Glue can process this file too. G.2X worker type has 32 GB of memory, but it also has 128 GB of disk space, which is utilized by a worker if it needs the space (and in a shuffle operation). You can also add your custom packages per job.
How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.
I have been planning to migrate my Batch processing from Spring Batch to AWS Batch. Can someone give me the reasons to Choose AWS Batch over Spring Batch?
Whilst both these things will play a role in orchestrating your batch workloads a key difference is that AWS Batch will also manage the infrastructure you need to run the jobs/pipeline. AWS Batch lets you to tailor the underlying cloud instances, or specifcy a broad array of instance types that will work for you. And it'll let you make trade-offs: you can task it with managing a bag of EC2 Spot instances for you (for example), and then ask it to optimize time-to-execution over price (or prefer price to speed).
(For full disclosure, I work for the engineering team that builds AWS Batch).
I believe both work at different levels .Spring batch provide framework that reduce the boiler plate code that you need in order to write a batch job.For eg. saving the state of job in Job repository that provide restartability.
On the contrary, AWS batch is an infrastructure framework that helps in managing infra and set some environment variable that help differentiate master node from slave node.
In my opinion both can work together to write a full fledged cost effective batch job at scale on AWS cloud.
Aws Batch is full blown SaaS solution for batch processing,
It has inbuilt
Queue with priority options
Runtime, which can be self managed and auto managed
Job repo, docker images for the job definitions
Monitoring , dashboards, integration with other AWS services like SNS and from SNS to where ever you want
On the other hand, batch is a framework which would still need some of your efforts to manage it all. like employing a queue, scaling is your headache, monitoring etc
My take is , if your company or app is on AWS , go for AWS batch , you will save months of time and get to scalability to a million jobs per day in no time . If you are on-perm or private go for spring batch with some research
I am very new in AWS and when I was searching something to download a code from GitHub (a python project), run it, and save the output in s3 the first service that I found was CodeBuild.
So I implement this kind of workflow using CodeBuild.
But now I have seen that AWS have a service called AWS Batch and I am wondering if I should migrate my arquitecture to AWS Batch.
Can you explain which one - AWS CodeBuild or AWS Batch - is more suitable with my case? When use AWS Batch instead of AWS CodeBuild?
Thank very much.
TLDR Summary: AWS Codebuild is the nicer choice for simple jobs.
My (reverse) experience...
I needed to run a simple job that pulls data from external api, read/write to external database, and generate a CSV report.
The job takes ~1 hour to run, so AWS Lambda is out of the picture.
After some googling, I found AWS Batch and decided to give Creating a Simple “Fetch & Run” AWS Batch Job a try.
The required steps to this "simple" job working:
Build a Docker image with the fetch & run script
Create an Amazon ECR repository for the image
Push the built image to ECR
Create a simple job script and upload it to S3
Create an IAM role to be used by jobs to access S3
Configure a compute environment
Create a job queue
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
After spending the time to create all these resources, it did not work out of the box. I found myself debugging random things I shouldn't have to debug such as:
Dockerfile
entrypoint script
ECS cluster
EC2 instance and autoscaling group
After failing to find simple practical examples, and realizing the amount of effort required, I decided to explore other solutions.
I stumbled onto Using AWS CodeBuild to execute administrative tasks and this post.
I've used AWS Codebuild in the past for CI/CD pipeline, and thought "what the hell, lets give it a try". In a shorter amount of time, I was able to get a "codebuild job" running on a cloudwatch scheduler and codebuild slack notifications added, with less effort:
Connect build project to your source code
Select a runtime environment
Create IAM role
Create a buildspec.yml and add runtime commands
One major advantage is that CodeBuild runs tasks in a full-blown Linux environment.
Drawbacks:
Max execution time of 8 hours
AWS Codebuild was much easier to get working for my simple job.
Sorry for the long post, just wanted to share my experience with these 2 services.
AWS Batch is used highly parallel computations, e.g., processing large number of images at the same time:
AWS Batch enables you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources, and AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software.
Thus its not suited for what you are trying to use it. CodeBuild is better choice, based on your description.
i'm currently using some glue jobs for minimum transformations and sending info from S3/Athena tables to Redshift, now we don't process a lot of data so glue is expensive, slow and difficult to tune for this volume of data.
I couldn't find how to start in EC2 to make the code migration, credentials, dependencies.
Maybe I can call a lambda to process it in my EC2 instance? Can I run spark on 1 node and then scale to a cluster in the future? should I migrate Glue Job to python (not pyspark)?
I found EMR will be expensive too for this volume, the ideal is start with minumum
Don't need the full solution, just pointing in the right direction so I can start trying this.
Thank you!
Here are few suggetions for your requiremnt
Serverless frameworks like Glue and lambda is more suitable rather than persisted EMR or EC2
AWS Lambda: You can consider using lambda with python modules, if your volume of data is less and transformations are minimal.
AWS Glue with Python not spark - It's also a cost effective solution.
AWS Ec2 - Going for EC2 legacy approach and costly.