How to properly (scale-ably) read many ORC files into spark - amazon-web-services

I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.
Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.
What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.
Thanks in advance for your suggestions.

Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.
sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")
You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.
Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.

Related

what does it mean "partitioned data" - S3

I want to use Netflix's outputCommitter (Using Spark with Amazon EMR).
In the README there are 2 options:
S3DirectoryOutputCommitter - for writing unpartitioned data to S3 with conflict resolution.
S3PartitionedOutputCommitter - for writing partitioned data to S3 with conflict resolution.
I tried to understand the differences but unsuccessfully. Can someone explain what is "partitioned data" in s3?
according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive’s partitioning strategy: different levels of the tree represent different columns."
search in the hadoop docs for the full details.
be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs. but since their work is a reimplementation of the netflix work, they should do the same thing here
I'm not familiar with outputCommitter, by partitioned data in Amazon S3 normally refers to splitting files amongst directories to reduce the amount of data that needs to be read from disk.
For example:
/data/month=1/
/data/month=2/
/data/month=3/
...
If a Hive-type query is run against the data with a clause like WHERE month=1, then it would only need to look in the month=1/ subdirectory, thereby saving 2/3rds of disk access.

Apache Spark - Write Parquet Files to S3 with both Dynamic Partition Overwrite and S3 Committer

I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation

parquet files protection when appending

I have a problem when I try to do ETL on large bunch of files on AWS.
The goal is to convert JSON files to parquet files. due to the size of the files I have to do it batch by batch . Let's say I need to do it in 15 batches , i.e. 15 separate runs to be able to convert all of them.
I am using write.mode("append").format("parquet") to write into parquet files in each glue pyspark job to do that.
My problem is if one job failed for some reason then I don't know what to do - some partitions are updated while some are not, some files in the batch have been processed while some have not. for example if my 9th job failed, I am kind of stuck. I dont want to delete all parquet files to start over, but also dont want to just re-run that 9th job and cause duplicates.
Is there a way to protect parquet files to only append new files into them if the whole job is successful?
THank you!!
Based on your comment and a similar experience I had with this problem, I believe this happens because of S3 eventual consistency. Have a look at Amazon S3 Data Consistency Model here https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html.
We found that using partitioned staging s3a committer with the conflict resolution mode replace made our jobs not fail.
Try the following parameters with your spark jobs:
spark.hadoop.fs.s3a.committer.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.name partitioned
Also have a read about the committers here:
https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
Hope this helps!
P.S. If all fails and our files are not too big, you can do a hacky solution where you save your parquet file locally and upload when your spark tasks are complete, but I personally do not recommend.

Processing unpartitioned data with AWS Glue using bookmarking

I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.

How exactly does Spark on EMR read from S3?

Just a few simple questions on the actual mechanism behind reading a file on s3 into an EMR cluster with Spark:
Does spark.read.format("com.databricks.spark.csv").load("s3://my/dataset/").where($"state" === "WA") communicate the whole dataset into the EMR cluster's local HDFS and then perform the filter after? Or does it filter records when bringing the dataset into the cluster? Or does it do neither? If this is the case, what's actually happening?
The official documentation lacks an explanation of what's going on (or if it does have an explanation, I cannot find it). Can someone explain, or link to a resource with such an explanation?
I can't speak for the closed source AWS one, but the ASF s3a: connector does its work in S3AInputStream
Reading data is via HTTPS, which has slow startup time, and if you need to stop the download before the GET is finished, forces you to abort the TCP stream and create a new one.
To keep this cost down the code has features like
Lazy seek: when you do a seek(), it updates its internal pointer but doesn't issue a new GET until you actually do a read.
chooses whether to abort() vs read to end on a GET based on how much is left
Has 3 IO modes:
"sequential", GET content range is from (pos, EOF). Best bandwidth, worst performance on seek. For: CSV, .gz, ...
"random": small GETs, min(block-size, length(read)). Best for columnar data (ORC, Parquet) compressed in a seekable format (snappy)
"adaptive" (new last week, based on some work from microsoft on the Azure WASB connector). Starts off sequential, as soon as you do a backwards seek switches to random IO
Code is all there, improvements welcome. The current perf work (especially random IO) based on TPC-DS benchmarking of ORC data on Hive, BTW)
Assuming you are reading CSV and filtering there, it'll be reading the entire CSV file and filtering. This is horribly inefficient for large files. Best to import into a column format and use predicate pushdown for the layers below to seek round the file for filtering and reading columns
Loading data from S3 (s3://-) usually goes via EMRFS in EMR
EMRFS directly access S3 (not via HDFS)
When Spark loads data from S3, they are stored as DataSet in the cluster according to StorageLevel(memory or disk)
Finally, Spark filters loaded data
When you specify files located on S3 they are read into the cluster. The processing happens on the cluster nodes.
However, this may be changing with S3 Select, which is now in preview.