What controls the number of partitions when reading Parquet files? - amazon-web-services

My setup:
Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1.
The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts.
The code:
Read a folder containing 12 Parquet files and count the number of partitions
val logs = sqlContext.parquetFile(“s3n://mylogs/”)
logs.rdd.partitions.length
Observations:
On EC2 this code gives me 12 partitions (one per file, makes sense).
On EMR this code gives me 138 (!) partitions.
Question:
What controls the number of partitions when reading Parquet files?
I read the exact same folder on S3, with the exact same Spark release. This leads me to believe that there might be some configuration settings which control how partitioning happens. Does anyone have more info on this?
Insights would be greatly appreciated.
Thanks.
UPDATE:
It seems that the many partitions are created by EMR's S3 file system implementation (com.amazon.ws.emr.hadoop.fs.EmrFileSystem).
When removing
<property><name>fs.s3n.impl</name><value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value></property>
from core-site.xml (hereby reverting to Hadoop's S3 filesystem), I end up with 12 partitions.
When running with EmrFileSystem, it seems that the number of partitions can be controlled with:
<property><name>fs.s3n.block.size</name><value>xxx</value></property>
Could there be a cleaner way of controlling the # of partitions when using EmrFileSystem?

Related

AWS Glue Outputting Empty Files on Sequential Runs

I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want. However, when I add new data to the MYSQL database and run the Glue job again. I get an empty file instead of the added rows. Any help would be MUCH appreciated.
Bookmarking rules for JDBC Sources are here. Important point to remember for JDBC sources is that values have to be increasing or decreasing order and Glue only processes new data from last checkpoint.
Typically, either an autogenerated sequence number or a datatime used as key for bookmarking
For anybody who is still struggling with this (it drove me mad, because i thought my spark code was wrong), disable bookmarking in job details.

Performance issue with AWS EMR S3DistCp

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).
Here is my EMR configuration:
1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0
The command:
s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128
Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.
Thank you.
Here are the recomendations
use R type instance. It will provide more memory compared to M type instances
use coalesce to merge the files in source as you have many small files
Check the number of mapper tasks. The more the task, the lesser the performance

Spark and continuous processing of data

I am new to Spark but I am reading up as much as I can. I have a small project where multiple data files (in gzip) are going to continuously land in an S3 bucket every hour. I need to be able to open/read these gzip files and consolidate/aggregate data across them. So, I need to look at them in a holistic fashion. How, what techniques and tools from Amazon AWS can be used? Do I create interim files in a S3 folder or hold Dataframes in memory or use some database and blow away the data after each hour? So, I am looking for ideas more than a piece of code.
So far, in AWS, I have written a pyspark script that reads 1 file at a time and create an output file back in output S3 folder. But that leaves me with multiple output files for each hour. Would be nice if there was 1 file for a given hour.
From technology perspective, I am using an EMR cluster with just 1 master and 1 core node, Pyspark and S3.
Thanks
You could use an AWS Glue ETL job written in PySpark. Glue jobs can be scheduled to run every hour.
I suggest reading the entire dataset, performing your operations, and then moving the data to another long-term storage location.
If you are working on a few GB of data, a PySpark job should complete within minutes. There's no need to keep an EMR cluster running for an hour if you'll only need it for 10 minutes. Consider using short-lived EMR clusters or a Glue ETL job.
Athena supports querying GZipped data. If you're performing some sort of analysis, maybe executing an Athena query with a time range will work?
You could also use a CTAS (Create Table As Select) statement in Athena to copy data to a new location, and performing basic ETL on it at the same time.
What exactly does your PySpark code do?

How to fetch data from EMR Spark session?

I'm doing designing some ETL data pipelines with Airflow. Data transformations is done by provisioning an AWS EMR Spark cluster and sending its some jobs. The jobs read data from S3, process them and write them back to S3 using date as a partition.
For my last step, I need to load the S3 data to a datawarehouse using SQL scripts that are submitted to Redshift using Python script, however I cannot find a clean way to get retrieve which data need to be loaded, ie. which date partitions have been generated during Spark transformations (can only be known during the execution of the job and not beforehand).
Note that everything is orchestrated through a Python script using boto3 library that is run from a corporate VM that cannot be accessed from outside.
What would be the best way to fetch this information from EMR?
For now I'm thinking about different solutions:
- Write the information into a log file. Get the data from Spark master node using SSH through Python script
- Write the information to an S3 file
- Write the information to a database (RDS?)
I'm struggling to determine what are the pros and the cons of these solutions. I'm also wondering what would be the best way to inform that data transformations is over and that metadata can be fetched.
Thanks in advance
The most straightforward is to use S3 as your temporary storage. After finishing your Spark execution (Writing result to S3), you can add one more step writing data to S3 bucket which you want to get in next step.
The approach with RDS should be the similar to S3, but it requires more implementations than S3. You need to setup RDS, maintain Schema, implementation to work with RDS...
With S3 tmp file, after EMR terminated and AF running next step, using Boto to fetch that tmp file (S3 Path depends on your requirement) and that is it.

How to work with csv using AWS EMR?

I'm copying .csv files into s3 bucket and i need to join them like in RDB. Is it possible to do this? I hope for your great minds. =)
You can do this using AWS Data pipeline and EMR.
EMR supports CSV (and TSV) as types (means, it will understand the files and has capability to consider this as a table with data rows).
You will keep these files in an S3 bucket and this bucket gets mounted as an HDFS (Hadoop Distributed File System) table. Once this has happened you can issue HIVE queries (which can be join as well) and do most of the things you need to.
I will point you to the doc from here on. You will need to spend some time to read and understand the entire setup, but once mastered it is very handy.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-s3tos3hivecsv.html