Spark Dataframe loading 500k files on EMR - python-2.7

I am running pyspark job on EMR ( 5.5.1 ) with Spark 2.1.0, Hadoop 2.7.3, Hive 2.1.1, Sqoop 1.4.6 and Ganglia 3.7.2 which is loading data from s3. There are multiple buckets that contain input files so I have a function which uses boto to traverse through them and filter them out according to some pattern.
Cluster Size: Master => r4.xlarge , Workers => 3 x r4.4xlarge
Problem:
The function getFilePaths returns a list of s3 paths which is directly fed to spark dataframe load method.
Using Dataframe
file_list = getFilePaths() # ['s3://some_bucket/log.json.gz','s3://some_bucket/log2.json.gz']
schema = getSchema() # for mapping to the json files
df = sparkSession.read.format('json').load(file_list, schema=schema)
Using RDD
master_rdd = sparkSession.sparkContext.union(
map(lambda file: sparkSession.sparkContext.textFile(file), file_list)
)
df = sparkSession.createDataFrame(master_rdd, schema=schema)
The file_list can be a huge list ( max 500k files ) due to large amount of data & files. Calculation of these paths only takes 5-20mins but when trying to load them as dataframe with spark, spark UI remains inactive for hours i.e. not processing anything at all. The inactivity period for processing 500k files is above 9hrs while for 100k files it is around 1.5hrs.
Viewing Gangilla metrics shows that only driver is running/processing while workers are idle. There are no logs generated until the spark job has finished and I haven't got any success with 500k files.
I have tried s3, s3n connectors but no success.
Question:
Figure out the root cause of this delay?
How can I debug it properly ?

In general, Spark/Hadoop prefer to have large files they can split instead of huge numbers of small files. One approach you might try though would be to parallelize your file list and then load the data in a map call.
I don't have the resources right now to test this out, but it should be something similar to this:
file_list = getFilePaths()
schema = getSchema() # for mapping to the json files
paths_rdd = sc.parallelize(file_list)
def get_data(path):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, path)
data = obj.get()['Body'].read().decode('utf-8')
return [json.loads(r) for r in data.split('\n')]
rows_rdd = rdd.flatMap(get_data)
df = spark.createDataFrame(rows_rdd, schema=schema)
You could also make this a little more efficient by using mapPartition instead so you don't need to recreate the s3 object each time.
EDIT 6/14/18:
With regards to handling the gzip data, you can decompress a stream of gzip data using python as detailed in this answer: https://stackoverflow.com/a/12572031/1461187 . Basically just pass in obj.get()['Body'].read() into the function defined in that answer.

There's two performance issues surfacing
reading the files: gzip files can't be split to have their workload shared across workers, though with 50 MB files, there's little benefit in splitting things up
The way the S3 connectors spark uses mimic a directory structure is a real performance killer for complex directory trees.
Issue #2 is what slows up partitioning: the initial code to decide what to do, which is done before any of the computation.
How would I go about trying to deal with this? Well, there's no magic switch here. But
have fewer, bigger files; as noted, Avro is good, so are Parquet and ORC later.
use a very shallow directory tree. Are these files all in one single directory? Or in a deep directory tree? The latter is worse.
Coalesce the files first.
I'd also avoid any kind of schema inference; it sounds like you aren't doing that (good!), but for anyone else reading this answer: know that for CSV and presumably JSON, schema inference means "read through all the data once just to work out the schema"

Related

AWS Glue Limit data read from S3 Bucket

I have a large bucket that contains more than 6M files.
I've run into this error Failed to sanitize XML document destined for handler class and i think this is the problem: https://github.com/lbroudoux/es-amazon-s3-river/issues/16
Is there a way I can limit how many files are read in the first runs?
This is what I have DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3-sat-dth-prd", table_name = "datahub_meraki_user_data", transformation_ctx = "DataSource0"), can I tell it to read only a folder in my bucket? Every folder within is called like this: partition=13/, partition=14/, partition=n/ and so on.
How can I work around this?
Thanks in advance.
There are three main ways (as I know) to avoid this situation.
1. Load from a prefix
In order to load files from a specific path in AWS Glue, you can use the below syntax.
from awsglue.dynamicframe import DynamicFrame
dynamic_frame = context.create_dynamic_frame_from_options(
"s3",
{
'paths': ['s3://my_bucket_1/my_prefix_1'],
'recurse': True,
'groupFiles': 'inPartition',
'groupSize': '1073741824'
},
format='json',
transformation_ctx='DataSource0'
)
You can put multiple paths for paths and Glue will load from all of them.
2. Use Glue Bookmarks.
When you have millions of files in a bucket and you want to load only the new files (between the runs of your Glue job), you can enable Glue Bookmarks. It will keep track of the files it read in an internal index (which we don't have access to).
You can pass this as a parameter when you define the job.
MyJob:
Type: AWS::Glue::Job
Properties:
...
GlueVersion: 2.0
Command:
Name: glueetl
PythonVersion: 3
...
DefaultArguments: {
"--job-bookmark-option": job-bookmark-enable,
...
This will enable bookmarks defined with the name used for transformation_ctx when you load data. Yes, it's confusing that AWS uses the same parameter for multiple purposes!
It's also important that you must not forget to add job.commit() at the end of your Glue script, where job is your from awsglue.job import Job instance.
Then, when you use the same context.create_dynamic_frame_from_options() function with your root prefix and the same transformation_ctx, it will only load the new files in the prefix in the hierarchy. It saves a lot of hassle for us in looking for new files. Read the docs for more information on bookmarks.
3. Avoid smaller file sizes.
AWS Glue will take ages to load files if you have quite smaller files. So, if you can control the file size, then make the files at least 100MB in size. For instance, we were writing to S3 from a Firehose stream and we could adjust the buffer size to avoid smaller file sizes. This drastically increased the loading times for our Glue job.
I hope these tips will help you. And feel free to ask any questions if you need further clarification.
There is a way to control the # of files called a BoundedExecution. It's documented here: https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html
In the following examples you would be loading in 200 files at a time. Note you must enable Glue bookmarks for this to work correctly.
If you are using from_options it looks like this:
DataSource0 = glueContext.create_dynamic_frame.from_options(
format_options={"withHeader": True, "separator": separator, "quoteChar": quoteChar},
connection_type="s3",
format="csv",
connection_options={"paths": inputFilePath,
"boundedFiles": "200", "recurse": True},
transformation_ctx="DataSource0"
)
If you are using from_catalog it looks like this:
DataSource0 = glueContext.create_dynamic_frame.from_catalog(
database = "database-name",
table_name= "table-name",
additional_options={"boundedFiles": "200"},
transformation_ctx="DataSource0"
)

Analysis of Log with Spark Streaming

I recently did analysis on a static log file with Spark SQL (find out stuff like the ip addresses which appear more than ten times). The problem was from this site. But I used my own implementation for it. I read the log into an RDD, turned that RDD to a DataFrame (with the help of a POJO) and used DataFrame operations.
Now I'm supposed to do a similar analysis using Spark Streaming for a streaming log file for a window of 30 mins as well as aggregated results for a day. The solution can again be found here but I want to do it another way. So what I've done is this
Use Flume to write data from the log file to an HDFS directory
Use JavaDStream to read the .txt files from HDFS
Then I can't figure out how to proceed. Here's the code I use
Long slide = 10000L; //new batch every 10 seconds
Long window = 1800000L; //30 mins
SparkConf conf = new SparkConf().setAppName("StreamLogAnalyzer");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, new Duration(slide));
JavaDStream<String> dStream = streamingContext.textFileStream(hdfsPath).window(new Duration(window), new Duration(slide));
Now I can't seem to decide if I should turn each batch to a DataFrame and do what I previously did with the static log file. Or is this way time consuming and overkill.
I'm an absolute noob to Streaming as well as Flume. Could someone please guide me with this?
Using DataFrame (and Dataset) in Spark is most promoted way in latest versions of Spark, so it's a right choice to go with. I think that some obscurity appears because of non-explicit nature of stream, when you move files into HDFS rather than read from any event log.
Main point here is to choose correct batch time size (or slide size as in your snippet), so application would process data it loaded under that time slot and there would not be batch queue.

how to handle millions of smaller s3 files with apache spark

so this problem has been driving me nuts, and it is starting to feel like spark with s3 is not the right tool for this specific job. Basically, I have millions of smaller files in an s3 bucket. For reasons I can't necessarily get into, these files cannot be consolidated (one they are unique encrypted transcripts). I have seen similar questions as this one, and every single solution has not produced good results. First thing I tried was wild cards:
sc.wholeTextFiles(s3aPath + "/*/*/*/*.txt").count();
Note: the count was more debugging on how long it would take to process the files. This job almost took an entire day with over 10 instances but still failed with the error posted at the bottom of the listing. I then found this link, where it basically said this isn't optimal: https://forums.databricks.com/questions/480/how-do-i-ingest-a-large-number-of-files-from-s3-my.html
Then, I decided to try another solution that I can't find at the moment, which said load all of the paths, then union all of the rdds
ObjectListing objectListing = s3Client.listObjects(bucket);
List<JavaPairRDD<String, String>> rdds = new ArrayList<>();
List<JavaPairRDD<String, String>> tempMeta = new ArrayList<>();
//initializes objectListing
tempMeta.addAll(objectListing.getObjectSummaries().stream()
.map(func)
.filter(item -> item != null && item.getMediaType().equalsIgnoreCase("transcript"))
.map(item -> SparkConfig.getSparkContext().wholeTextFiles("s3a://" + bucket + "/" + item.getFileName()))
.collect(Collectors.toList()));
while(objectListing.isTruncated()) {
objectListing = s3Client.listNextBatchOfObjects(objectListing);
tempMeta.addAll(objectListing.getObjectSummaries().stream()
.map(func)
.filter(item -> item != null && item.getMediaType().equalsIgnoreCase("transcript"))
.map(item -> SparkConfig.getSparkContext().wholeTextFiles("s3a://" + bucket + "/" + item.getFileName()))
.collect(Collectors.toList()));
if (tempMeta.size() > 5000) {
rdds.addAll(tempMeta);
tempMeta = new ArrayList<>();
}
}
if (!tempMeta.isEmpty()){
rdds.addAll(tempMeta);
}
return SparkConfig.getSparkContext().union(rdds.get(0), rdds.subList(1, rdds.size()));
Then, even when I set set the emrfs-site config to:
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3.consistent.retryPolicyType": "fixed",
"fs.s3.consistent.retryPeriodSeconds": "15",
"fs.s3.consistent.retryCount": "20",
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.consistent": "false"
}
}
I got this error within 6 hours of every time I tried running the job:
17/02/15 19:15:41 INFO AmazonHttpClient: Unable to execute HTTP request: randomBucket.s3.amazonaws.com:443 failed to respond
org.apache.http.NoHttpResponseException: randomBucket.s3.amazonaws.com:443 failed to respond
So first, is there a way to use smaller files with spark from s3? I don't care if the solution is suboptimal, I just want to try and get something working. I thought about trying spark streaming, since its internals are a little different with loading all of the files. I would then use fileStream and set newFiles to false. Then I could batch process them. However, that is not what spark streaming was built for, so I am conflicted in going that route.
As a side note, I generated millions of small files into hdfs, and tried the same job, and it finished within an hour. This makes me feel like it is s3 specific. Also, I am using s3a, not the ordinary s3.
If you are using amazon EMR, then you need to use s3:// URLs; the s3a:// ones are for the ASF releases.
A big issue is just how long it takes to list directory trees in s3, especially that recursive tree walk. The spark code assumes its a fast filesystem where listing dirs and stating files is low cost, whereas in fact each operation takes 1-4 HTTPS requests, which, even on reused HTTP/1.1 connections, hurts. It can be so slow you can see the pauses in the log.
Where this really hurts is that it is the up front partitioning where a lot of the delay happens, so it's the serialized bit of work which is being brought to its knees.
Although there's some speedup in treewalking on S3a coming in Hadoop 2.8 as part of the S3a phase II work, wildcard scans of //*.txt form aren't going to get any speedup. My recommendation is to try to flatten your directory structure so that you move from a deep tree to something shallow, maybe even all in the same directory, so that it can be scanned without the walk, at a cost of 1 HTTP request per 5000 entries.
Bear in mind that many small file are pretty expensive anyway, including in HDFS, where they use up storage. There's a special aggregate format, HAR files, which are like tar files except that hadoop, hive and spark can all work inside the file itself. That may help, though I've not seen any actual performance test figures there.

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?
what is the best way to do it using some hdfs or linux commands?
we used to merge the text files using cat command, but will this work for parquet as well?
Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?
According to this https://issues.apache.org/jira/browse/PARQUET-460
Now you can download the source code and compile parquet-tools which is built in merge command.
java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
/output_idr/file_name
Or using a tool like https://github.com/stripe/herringbone
You can also do it using HiveQL itself, if your execution engine is mapreduce.
You can set a flag for your query, which causes hive to merge small files at the end of your job:
SET hive.merge.mapredfiles=true;
or
SET hive.merge.mapfiles=true;
if your job is a map-only job.
This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.
Using duckdb :
import duckdb
duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.
Pig Code Sample
data = load 's3://data-bucket/' USING PigStorage(',') AS (line:chararray)
Hive Code Sample
CREATE EXTERNAL TABLE data (value STRING) LOCATION 's3://data-bucket/';
Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?
I tried the following without any noticeable effects:
Increase #Task Nodes
set hive.optimize.s3.query=true
manually set #mappers
Increase instance type from medium up to xlarge
I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.
You can either :
use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have a pig script that will do it for you, once.
If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :
// to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;
Please provide metrics for thoses cases