AWS Glue Limit data read from S3 Bucket - amazon-web-services

I have a large bucket that contains more than 6M files.
I've run into this error Failed to sanitize XML document destined for handler class and i think this is the problem: https://github.com/lbroudoux/es-amazon-s3-river/issues/16
Is there a way I can limit how many files are read in the first runs?
This is what I have DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3-sat-dth-prd", table_name = "datahub_meraki_user_data", transformation_ctx = "DataSource0"), can I tell it to read only a folder in my bucket? Every folder within is called like this: partition=13/, partition=14/, partition=n/ and so on.
How can I work around this?
Thanks in advance.

There are three main ways (as I know) to avoid this situation.
1. Load from a prefix
In order to load files from a specific path in AWS Glue, you can use the below syntax.
from awsglue.dynamicframe import DynamicFrame
dynamic_frame = context.create_dynamic_frame_from_options(
"s3",
{
'paths': ['s3://my_bucket_1/my_prefix_1'],
'recurse': True,
'groupFiles': 'inPartition',
'groupSize': '1073741824'
},
format='json',
transformation_ctx='DataSource0'
)
You can put multiple paths for paths and Glue will load from all of them.
2. Use Glue Bookmarks.
When you have millions of files in a bucket and you want to load only the new files (between the runs of your Glue job), you can enable Glue Bookmarks. It will keep track of the files it read in an internal index (which we don't have access to).
You can pass this as a parameter when you define the job.
MyJob:
Type: AWS::Glue::Job
Properties:
...
GlueVersion: 2.0
Command:
Name: glueetl
PythonVersion: 3
...
DefaultArguments: {
"--job-bookmark-option": job-bookmark-enable,
...
This will enable bookmarks defined with the name used for transformation_ctx when you load data. Yes, it's confusing that AWS uses the same parameter for multiple purposes!
It's also important that you must not forget to add job.commit() at the end of your Glue script, where job is your from awsglue.job import Job instance.
Then, when you use the same context.create_dynamic_frame_from_options() function with your root prefix and the same transformation_ctx, it will only load the new files in the prefix in the hierarchy. It saves a lot of hassle for us in looking for new files. Read the docs for more information on bookmarks.
3. Avoid smaller file sizes.
AWS Glue will take ages to load files if you have quite smaller files. So, if you can control the file size, then make the files at least 100MB in size. For instance, we were writing to S3 from a Firehose stream and we could adjust the buffer size to avoid smaller file sizes. This drastically increased the loading times for our Glue job.
I hope these tips will help you. And feel free to ask any questions if you need further clarification.

There is a way to control the # of files called a BoundedExecution. It's documented here: https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html
In the following examples you would be loading in 200 files at a time. Note you must enable Glue bookmarks for this to work correctly.
If you are using from_options it looks like this:
DataSource0 = glueContext.create_dynamic_frame.from_options(
format_options={"withHeader": True, "separator": separator, "quoteChar": quoteChar},
connection_type="s3",
format="csv",
connection_options={"paths": inputFilePath,
"boundedFiles": "200", "recurse": True},
transformation_ctx="DataSource0"
)
If you are using from_catalog it looks like this:
DataSource0 = glueContext.create_dynamic_frame.from_catalog(
database = "database-name",
table_name= "table-name",
additional_options={"boundedFiles": "200"},
transformation_ctx="DataSource0"
)

Related

Spark: how to write dataframe to S3 efficiently

I am trying to figure out which is the best way to write data to S3 using (Py)Spark.
It seems I have no problem in reading from S3 bucket, but when I need to write it is really slow.
I've started the spark shell like so (including the hadoop-aws package):
AWS_ACCESS_KEY_ID=<key_id> AWS_SECRET_ACCESS_KEY=<secret_key> pyspark --packages org.apache.hadoop:hadoop-aws:3.2.0
This is the sample application
# Load several csv files from S3 to a Dataframe (no problems here)
df = spark.read.csv(path='s3a://mybucket/data/*.csv', sep=',')
df.show()
# Some processing
result_df = do_some_processing(df)
result_df.cache()
result_df.show()
# Write to S3
result_df.write.partitionBy('my_column').csv(path='s3a://mybucket/output', sep=',') # This is really slow
When I try to write to S3, I get the following warning:
20/10/28 15:34:02 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
Is there any setting I should change to have efficient write to S3? As now it is really slow, it took about 10 min to write 100 small files to S3.
It turns out you have to manually specify the committer (otherwise the default one will be used, which isn't optimized for S3):
result_df \
.write \
.partitionBy('my_column') \
.option('fs.s3a.committer.name', 'partitioned') \
.option('fs.s3a.committer.staging.conflict-mode', 'replace') \
.option("fs.s3a.fast.upload.buffer", "bytebuffer") \ # Buffer in memory instead of disk, potentially faster but more memory intensive
.mode('overwrite') \
.csv(path='s3a://mybucket/output', sep=',')
Relevant documentation can be found here:
hadoop-aws
hadoop-aws-committers

How can I delete old files by name in S3 bucket?

Much like in S3-Bucket/Management/Lifecycles using prefixes, I'd like to prune old files that have certain words.
I'm looking to remove files that start with Screenshot or has screencast in the filename older than 365 days.
Examples
/Screenshot 2017-03-19 10.11.12.png
folder1/Screenshot 2019-03-01 14.31.55.png
folder2/sub_folder/project-screencast.mp4
I'm currently testing if lifecycle prefixes work on files too.
You can write a program to do it, such as this Python script:
import boto3
s3 = boto3.client('s3', region_name='ap-southeast-2')
response = s3.list_objects_v2(Bucket='my-bucket')
keys_to_delete = [{'Key': object['Key']}
for object in response['Contents']
if object['LastModified'] < datetime(2018, 3, 20)
and ('Screenshot' in object['Key'] or 'screencast' in object['Key'])
]
s3.delete_objects(Bucket='my-bucket', Delete={'Objects': keys_to_delete})
You could modify it to be "1 year ago" rather than a specific date.
I don't believe that you can apply lifecycle rules with wildcards such as *screencast*, only with prefixes such as "taxes/" or "taxes/2010".
For your case, I would probably write a script (or perhaps an Athena query) to filter an S3 Inventory report for those files that match your name/age conditions, and then prune them.
Of course, you could write a program to do this as #John Rotenstein suggests. The one time that might not be ideal is if you have millions or billions of objects because the time to enumerate the list of objects would be significant. But it would be fine for a reasonable number of objects.

Spark Dataframe loading 500k files on EMR

I am running pyspark job on EMR ( 5.5.1 ) with Spark 2.1.0, Hadoop 2.7.3, Hive 2.1.1, Sqoop 1.4.6 and Ganglia 3.7.2 which is loading data from s3. There are multiple buckets that contain input files so I have a function which uses boto to traverse through them and filter them out according to some pattern.
Cluster Size: Master => r4.xlarge , Workers => 3 x r4.4xlarge
Problem:
The function getFilePaths returns a list of s3 paths which is directly fed to spark dataframe load method.
Using Dataframe
file_list = getFilePaths() # ['s3://some_bucket/log.json.gz','s3://some_bucket/log2.json.gz']
schema = getSchema() # for mapping to the json files
df = sparkSession.read.format('json').load(file_list, schema=schema)
Using RDD
master_rdd = sparkSession.sparkContext.union(
map(lambda file: sparkSession.sparkContext.textFile(file), file_list)
)
df = sparkSession.createDataFrame(master_rdd, schema=schema)
The file_list can be a huge list ( max 500k files ) due to large amount of data & files. Calculation of these paths only takes 5-20mins but when trying to load them as dataframe with spark, spark UI remains inactive for hours i.e. not processing anything at all. The inactivity period for processing 500k files is above 9hrs while for 100k files it is around 1.5hrs.
Viewing Gangilla metrics shows that only driver is running/processing while workers are idle. There are no logs generated until the spark job has finished and I haven't got any success with 500k files.
I have tried s3, s3n connectors but no success.
Question:
Figure out the root cause of this delay?
How can I debug it properly ?
In general, Spark/Hadoop prefer to have large files they can split instead of huge numbers of small files. One approach you might try though would be to parallelize your file list and then load the data in a map call.
I don't have the resources right now to test this out, but it should be something similar to this:
file_list = getFilePaths()
schema = getSchema() # for mapping to the json files
paths_rdd = sc.parallelize(file_list)
def get_data(path):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, path)
data = obj.get()['Body'].read().decode('utf-8')
return [json.loads(r) for r in data.split('\n')]
rows_rdd = rdd.flatMap(get_data)
df = spark.createDataFrame(rows_rdd, schema=schema)
You could also make this a little more efficient by using mapPartition instead so you don't need to recreate the s3 object each time.
EDIT 6/14/18:
With regards to handling the gzip data, you can decompress a stream of gzip data using python as detailed in this answer: https://stackoverflow.com/a/12572031/1461187 . Basically just pass in obj.get()['Body'].read() into the function defined in that answer.
There's two performance issues surfacing
reading the files: gzip files can't be split to have their workload shared across workers, though with 50 MB files, there's little benefit in splitting things up
The way the S3 connectors spark uses mimic a directory structure is a real performance killer for complex directory trees.
Issue #2 is what slows up partitioning: the initial code to decide what to do, which is done before any of the computation.
How would I go about trying to deal with this? Well, there's no magic switch here. But
have fewer, bigger files; as noted, Avro is good, so are Parquet and ORC later.
use a very shallow directory tree. Are these files all in one single directory? Or in a deep directory tree? The latter is worse.
Coalesce the files first.
I'd also avoid any kind of schema inference; it sounds like you aren't doing that (good!), but for anyone else reading this answer: know that for CSV and presumably JSON, schema inference means "read through all the data once just to work out the schema"

Google ml-engine: takes forever to fill the queue

I have created tf records files that are stored on a google storage bucket. I have a code running on ml-engine to train a model using the data in these tf records
Each tf record file contains a batch of 20 examples and is approximately 8Mb size (Mega bytes). There are several thousands of files on the bucket.
My problem is that it litteraly takes forever to start the training. I have to wait about 40 minutes between the moment where the package is loaded and the moment where the training actually starts. I am guessing this is the time necessary to download the data and fill the queues?
The code is (slightly simplified for sake of conciseness):
# Create a queue which will produce tf record names
filename_queue = tf.train.string_input_producer(files, num_epochs=num_epochs, capacity=100)
# Read the record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Map for decoding the serialized example
features = tf.parse_single_example(
serialized_example,
features={
'data': tf.FixedLenFeature([], tf.float32),
'label': tf.FixedLenFeature([], tf.int64)
})
train_tensors = tf.train.shuffle_batch(
[features['data'], features['label']],
batch_size=30,
capacity=600,
min_after_dequeue=400,
allow_smaller_final_batch=True
enqueue_many=True)
I have checked that my bucket and my job share the same region parameter.
I don't understand what is taking so long: it should just be a matter of downloading a few hundreds Mbs (a few tens of tf records files should be enough to have more than min_after_dequeue elements in the queue).
Any idea of what am I missing, or where the problem might be?
Thanks
Sorry, my bad. I was using a custom function to:
Verify that each file passed as a tf record actually exists.
Expand wild-card characters, if any
Turns out this is a very bad idea when dealing with thousands of files on gs://
I have removed this "sanity" check and it's working fine now.

how to handle millions of smaller s3 files with apache spark

so this problem has been driving me nuts, and it is starting to feel like spark with s3 is not the right tool for this specific job. Basically, I have millions of smaller files in an s3 bucket. For reasons I can't necessarily get into, these files cannot be consolidated (one they are unique encrypted transcripts). I have seen similar questions as this one, and every single solution has not produced good results. First thing I tried was wild cards:
sc.wholeTextFiles(s3aPath + "/*/*/*/*.txt").count();
Note: the count was more debugging on how long it would take to process the files. This job almost took an entire day with over 10 instances but still failed with the error posted at the bottom of the listing. I then found this link, where it basically said this isn't optimal: https://forums.databricks.com/questions/480/how-do-i-ingest-a-large-number-of-files-from-s3-my.html
Then, I decided to try another solution that I can't find at the moment, which said load all of the paths, then union all of the rdds
ObjectListing objectListing = s3Client.listObjects(bucket);
List<JavaPairRDD<String, String>> rdds = new ArrayList<>();
List<JavaPairRDD<String, String>> tempMeta = new ArrayList<>();
//initializes objectListing
tempMeta.addAll(objectListing.getObjectSummaries().stream()
.map(func)
.filter(item -> item != null && item.getMediaType().equalsIgnoreCase("transcript"))
.map(item -> SparkConfig.getSparkContext().wholeTextFiles("s3a://" + bucket + "/" + item.getFileName()))
.collect(Collectors.toList()));
while(objectListing.isTruncated()) {
objectListing = s3Client.listNextBatchOfObjects(objectListing);
tempMeta.addAll(objectListing.getObjectSummaries().stream()
.map(func)
.filter(item -> item != null && item.getMediaType().equalsIgnoreCase("transcript"))
.map(item -> SparkConfig.getSparkContext().wholeTextFiles("s3a://" + bucket + "/" + item.getFileName()))
.collect(Collectors.toList()));
if (tempMeta.size() > 5000) {
rdds.addAll(tempMeta);
tempMeta = new ArrayList<>();
}
}
if (!tempMeta.isEmpty()){
rdds.addAll(tempMeta);
}
return SparkConfig.getSparkContext().union(rdds.get(0), rdds.subList(1, rdds.size()));
Then, even when I set set the emrfs-site config to:
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3.consistent.retryPolicyType": "fixed",
"fs.s3.consistent.retryPeriodSeconds": "15",
"fs.s3.consistent.retryCount": "20",
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.consistent": "false"
}
}
I got this error within 6 hours of every time I tried running the job:
17/02/15 19:15:41 INFO AmazonHttpClient: Unable to execute HTTP request: randomBucket.s3.amazonaws.com:443 failed to respond
org.apache.http.NoHttpResponseException: randomBucket.s3.amazonaws.com:443 failed to respond
So first, is there a way to use smaller files with spark from s3? I don't care if the solution is suboptimal, I just want to try and get something working. I thought about trying spark streaming, since its internals are a little different with loading all of the files. I would then use fileStream and set newFiles to false. Then I could batch process them. However, that is not what spark streaming was built for, so I am conflicted in going that route.
As a side note, I generated millions of small files into hdfs, and tried the same job, and it finished within an hour. This makes me feel like it is s3 specific. Also, I am using s3a, not the ordinary s3.
If you are using amazon EMR, then you need to use s3:// URLs; the s3a:// ones are for the ASF releases.
A big issue is just how long it takes to list directory trees in s3, especially that recursive tree walk. The spark code assumes its a fast filesystem where listing dirs and stating files is low cost, whereas in fact each operation takes 1-4 HTTPS requests, which, even on reused HTTP/1.1 connections, hurts. It can be so slow you can see the pauses in the log.
Where this really hurts is that it is the up front partitioning where a lot of the delay happens, so it's the serialized bit of work which is being brought to its knees.
Although there's some speedup in treewalking on S3a coming in Hadoop 2.8 as part of the S3a phase II work, wildcard scans of //*.txt form aren't going to get any speedup. My recommendation is to try to flatten your directory structure so that you move from a deep tree to something shallow, maybe even all in the same directory, so that it can be scanned without the walk, at a cost of 1 HTTP request per 5000 entries.
Bear in mind that many small file are pretty expensive anyway, including in HDFS, where they use up storage. There's a special aggregate format, HAR files, which are like tar files except that hadoop, hive and spark can all work inside the file itself. That may help, though I've not seen any actual performance test figures there.