S3DistCP - Split source in multiples jobs - hdfs

I have to do copy of an S3 to HDFS of an cluster EMR.
I'm trying to smaller the execution time of my job.
Looking in the logs the map input of the job is 1_000_000 of files.
I need to split this to 100_00 files per job.
Is possible defines this behavior in the command to add the step in emr?

Related

Copy records(row by row) from S3 to SQS - Using AWS Batch

I have setup a Glue Job which runs concurrently to process input files and writes it down to S3. The Glue job runs periodically (not a one time job).
The output in S3 is in a form of csv file. The requirement is to copy all those records into Aws SQS. Assuming there might be 100s of files, each containing upto million records.
Initially i was planning to have a lambda event to send the records row by row. however, from the doc i see a time limit for lambda as 15 mins- https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/#:~:text=You%20can%20now%20configure%20your,Lambda%20function%20was%205%20minutes.
Will it be better to use AWS Batch for copying the records from S3 to SQS ? I believe, AWS Batch has the capability to scale the process when needed and also perform the task in parallel.
I want to know if AWS Batch is a right pick or am i trying to more complicate the design ?

Performance issue with AWS EMR S3DistCp

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).
Here is my EMR configuration:
1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0
The command:
s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128
Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.
Thank you.
Here are the recomendations
use R type instance. It will provide more memory compared to M type instances
use coalesce to merge the files in source as you have many small files
Check the number of mapper tasks. The more the task, the lesser the performance

Spark and continuous processing of data

I am new to Spark but I am reading up as much as I can. I have a small project where multiple data files (in gzip) are going to continuously land in an S3 bucket every hour. I need to be able to open/read these gzip files and consolidate/aggregate data across them. So, I need to look at them in a holistic fashion. How, what techniques and tools from Amazon AWS can be used? Do I create interim files in a S3 folder or hold Dataframes in memory or use some database and blow away the data after each hour? So, I am looking for ideas more than a piece of code.
So far, in AWS, I have written a pyspark script that reads 1 file at a time and create an output file back in output S3 folder. But that leaves me with multiple output files for each hour. Would be nice if there was 1 file for a given hour.
From technology perspective, I am using an EMR cluster with just 1 master and 1 core node, Pyspark and S3.
Thanks
You could use an AWS Glue ETL job written in PySpark. Glue jobs can be scheduled to run every hour.
I suggest reading the entire dataset, performing your operations, and then moving the data to another long-term storage location.
If you are working on a few GB of data, a PySpark job should complete within minutes. There's no need to keep an EMR cluster running for an hour if you'll only need it for 10 minutes. Consider using short-lived EMR clusters or a Glue ETL job.
Athena supports querying GZipped data. If you're performing some sort of analysis, maybe executing an Athena query with a time range will work?
You could also use a CTAS (Create Table As Select) statement in Athena to copy data to a new location, and performing basic ETL on it at the same time.
What exactly does your PySpark code do?

Job chaining on Amazon EMR?

I need to do 2 chained M/R jobs, so I would need to use the output of the first job as input for the second.
How can I achive this on EMR?
You can add multiple jobs as steps. And use S3 to store the intermediate results. The second mapreduce job can read the intermediate results from S3 and continue to finish the work.

What controls the number of partitions when reading Parquet files?

My setup:
Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1.
The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts.
The code:
Read a folder containing 12 Parquet files and count the number of partitions
val logs = sqlContext.parquetFile(“s3n://mylogs/”)
logs.rdd.partitions.length
Observations:
On EC2 this code gives me 12 partitions (one per file, makes sense).
On EMR this code gives me 138 (!) partitions.
Question:
What controls the number of partitions when reading Parquet files?
I read the exact same folder on S3, with the exact same Spark release. This leads me to believe that there might be some configuration settings which control how partitioning happens. Does anyone have more info on this?
Insights would be greatly appreciated.
Thanks.
UPDATE:
It seems that the many partitions are created by EMR's S3 file system implementation (com.amazon.ws.emr.hadoop.fs.EmrFileSystem).
When removing
<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>
from core-site.xml (hereby reverting to Hadoop's S3 filesystem), I end up with 12 partitions.
When running with EmrFileSystem, it seems that the number of partitions can be controlled with:
<property><name>fs.s3n.block.size</name><value>xxx</value></property>
Could there be a cleaner way of controlling the # of partitions when using EmrFileSystem?