I have a Spark Structured Streaming job which takes data as input from AWS MSK (Kafka) and write to AWS S3. Is it a good idea to have a standing AWS EMR cluster always running the same? Or are there better ways to manage this infrastructure?
Please let me know if you need further details.
You need some worker pool that is consuming and writing.
Your other options include using EKS instead or YARN on EMR to run Spark, or you could not use Spark and use Kafka Connect S3 Sink instead on an EC2/EKS cluster.
I currently have an AWS EMR cluster running with HBase. And I am saving the data to S3. I want to migrate the data to a new EMR cluster on the same account. What is the proper way to migrate data from one EMR to another?
Thank you
There are different ways two copy the table from one cluster to another:
Use CopyTable utility. The disadvantage is that it can degrade the region server performance or there is a need to disable the tables prior to copy.
Hbase Snapshots. (Recommended). It has a little impact on region server performance.
You can follow the aws documentation to perform snapshot/restore operations.
Basically you will do the following:
Create Snapshot
Export to S3
Import from S3
Restore to Hbase
I am very new in AWS and when I was searching something to download a code from GitHub (a python project), run it, and save the output in s3 the first service that I found was CodeBuild.
So I implement this kind of workflow using CodeBuild.
But now I have seen that AWS have a service called AWS Batch and I am wondering if I should migrate my arquitecture to AWS Batch.
Can you explain which one - AWS CodeBuild or AWS Batch - is more suitable with my case? When use AWS Batch instead of AWS CodeBuild?
Thank very much.
TLDR Summary: AWS Codebuild is the nicer choice for simple jobs.
My (reverse) experience...
I needed to run a simple job that pulls data from external api, read/write to external database, and generate a CSV report.
The job takes ~1 hour to run, so AWS Lambda is out of the picture.
After some googling, I found AWS Batch and decided to give Creating a Simple “Fetch & Run” AWS Batch Job a try.
The required steps to this "simple" job working:
Build a Docker image with the fetch & run script
Create an Amazon ECR repository for the image
Push the built image to ECR
Create a simple job script and upload it to S3
Create an IAM role to be used by jobs to access S3
Configure a compute environment
Create a job queue
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
After spending the time to create all these resources, it did not work out of the box. I found myself debugging random things I shouldn't have to debug such as:
Dockerfile
entrypoint script
ECS cluster
EC2 instance and autoscaling group
After failing to find simple practical examples, and realizing the amount of effort required, I decided to explore other solutions.
I stumbled onto Using AWS CodeBuild to execute administrative tasks and this post.
I've used AWS Codebuild in the past for CI/CD pipeline, and thought "what the hell, lets give it a try". In a shorter amount of time, I was able to get a "codebuild job" running on a cloudwatch scheduler and codebuild slack notifications added, with less effort:
Connect build project to your source code
Select a runtime environment
Create IAM role
Create a buildspec.yml and add runtime commands
One major advantage is that CodeBuild runs tasks in a full-blown Linux environment.
Drawbacks:
Max execution time of 8 hours
AWS Codebuild was much easier to get working for my simple job.
Sorry for the long post, just wanted to share my experience with these 2 services.
AWS Batch is used highly parallel computations, e.g., processing large number of images at the same time:
AWS Batch enables you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources, and AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software.
Thus its not suited for what you are trying to use it. CodeBuild is better choice, based on your description.
I want to clear big picture about the aws Glue regarding some of the following aspects.
How AWS Glue prepare and provision its infrastructure? However it's serverless but how does it manage it?
How it's using apache spark and hadoop to solve so many ETL jobs at a time, Almost jobs of hundreds of AWS Glue customers from every region.
Thanks
AWS Glue uses EMR underneath. It spawns a new cluster with required number of executors (depending on configured DPU) when a new job starts. However, to improve cold start time they have a buffer of already provisioned EMR clusters for the most common number of DPUs. To manage all this they have a set of automated services that monitor state of each cluster, start a new ones etc.
I read through the documentation for 1.7 on the process for AWS which has their EMR deployment as the recommended way (but sounds like that is because of ease of deployment). My ideal scenario would be to have a session cluster, mostly because I am most familiar with that and the cluster wouldn't have to be redeployed on code changes.
My biggest question revolve around if there are there performance pros/cons between a cluster set up with docker on an ec2 (please correct me if I am wrong about the EC2 part. Still getting up to speed with AWS) vs EMR. The data volume would be in the high end of hundreds of millions a day.
Thanks for any help in advance!