Job chaining on Amazon EMR? - amazon-web-services

I need to do 2 chained M/R jobs, so I would need to use the output of the first job as input for the second.
How can I achive this on EMR?

You can add multiple jobs as steps. And use S3 to store the intermediate results. The second mapreduce job can read the intermediate results from S3 and continue to finish the work.

Related

Execute only one Glue job at a time / sequential Glue job execution

Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan

S3DistCP - Split source in multiples jobs

I have to do copy of an S3 to HDFS of an cluster EMR.
I'm trying to smaller the execution time of my job.
Looking in the logs the map input of the job is 1_000_000 of files.
I need to split this to 100_00 files per job.
Is possible defines this behavior in the command to add the step in emr?

Copy records(row by row) from S3 to SQS - Using AWS Batch

I have setup a Glue Job which runs concurrently to process input files and writes it down to S3. The Glue job runs periodically (not a one time job).
The output in S3 is in a form of csv file. The requirement is to copy all those records into Aws SQS. Assuming there might be 100s of files, each containing upto million records.
Initially i was planning to have a lambda event to send the records row by row. however, from the doc i see a time limit for lambda as 15 mins- https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/#:~:text=You%20can%20now%20configure%20your,Lambda%20function%20was%205%20minutes.
Will it be better to use AWS Batch for copying the records from S3 to SQS ? I believe, AWS Batch has the capability to scale the process when needed and also perform the task in parallel.
I want to know if AWS Batch is a right pick or am i trying to more complicate the design ?

What is the best way to copy large csv files from s3 to redshift?

I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.
You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.

Spark and continuous processing of data

I am new to Spark but I am reading up as much as I can. I have a small project where multiple data files (in gzip) are going to continuously land in an S3 bucket every hour. I need to be able to open/read these gzip files and consolidate/aggregate data across them. So, I need to look at them in a holistic fashion. How, what techniques and tools from Amazon AWS can be used? Do I create interim files in a S3 folder or hold Dataframes in memory or use some database and blow away the data after each hour? So, I am looking for ideas more than a piece of code.
So far, in AWS, I have written a pyspark script that reads 1 file at a time and create an output file back in output S3 folder. But that leaves me with multiple output files for each hour. Would be nice if there was 1 file for a given hour.
From technology perspective, I am using an EMR cluster with just 1 master and 1 core node, Pyspark and S3.
Thanks
You could use an AWS Glue ETL job written in PySpark. Glue jobs can be scheduled to run every hour.
I suggest reading the entire dataset, performing your operations, and then moving the data to another long-term storage location.
If you are working on a few GB of data, a PySpark job should complete within minutes. There's no need to keep an EMR cluster running for an hour if you'll only need it for 10 minutes. Consider using short-lived EMR clusters or a Glue ETL job.
Athena supports querying GZipped data. If you're performing some sort of analysis, maybe executing an Athena query with a time range will work?
You could also use a CTAS (Create Table As Select) statement in Athena to copy data to a new location, and performing basic ETL on it at the same time.
What exactly does your PySpark code do?