I am trying to figure out which is the best way to write data to S3 using (Py)Spark.
It seems I have no problem in reading from S3 bucket, but when I need to write it is really slow.
I've started the spark shell like so (including the hadoop-aws package):
AWS_ACCESS_KEY_ID=<key_id> AWS_SECRET_ACCESS_KEY=<secret_key> pyspark --packages org.apache.hadoop:hadoop-aws:3.2.0
This is the sample application
# Load several csv files from S3 to a Dataframe (no problems here)
df = spark.read.csv(path='s3a://mybucket/data/*.csv', sep=',')
df.show()
# Some processing
result_df = do_some_processing(df)
result_df.cache()
result_df.show()
# Write to S3
result_df.write.partitionBy('my_column').csv(path='s3a://mybucket/output', sep=',') # This is really slow
When I try to write to S3, I get the following warning:
20/10/28 15:34:02 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
Is there any setting I should change to have efficient write to S3? As now it is really slow, it took about 10 min to write 100 small files to S3.
It turns out you have to manually specify the committer (otherwise the default one will be used, which isn't optimized for S3):
result_df \
.write \
.partitionBy('my_column') \
.option('fs.s3a.committer.name', 'partitioned') \
.option('fs.s3a.committer.staging.conflict-mode', 'replace') \
.option("fs.s3a.fast.upload.buffer", "bytebuffer") \ # Buffer in memory instead of disk, potentially faster but more memory intensive
.mode('overwrite') \
.csv(path='s3a://mybucket/output', sep=',')
Relevant documentation can be found here:
hadoop-aws
hadoop-aws-committers
Related
I am looking to ingest data from a source to s3 using AWS Glue.
Is it possible to compress the ingested data in glue to specified value? For example: Compress the data to 500 MB and also be able to partition data based on compression value provided? if yes, how to enable this? I am writing the glue script in Python.
Compression & grouping are similar terms. Compression happens with parquet output. However you can use the 'groupSize': '31457280' (30 mb) to specify the size of the dynamic frame (and is the default output size) of the output file (at least most of them, the last file one is gonna be the remainder).
Also you need to be careful/leverage the Glue CPU type and quantity. like Maximum capacity 10, Worker type Standard.
The G.2X tend to create too many small files (it will/all depend on your situation/inputs.)
If you do nothing but read many small files and write them unchanged in a large group, they will be "default compressed/grouped" into the "groupsize". If you want to see drastic reductions in your file written size, then format the output as parquet. glueContext.create_dynamic_frame_from_options(connection_type = "s3", format="json",connection_options = {"paths":"s3://yourbucketname/folder_name/2021/01/"], recurse':True, 'groupFiles':'inPartition', 'groupSize': '31457280'})
I have a large bucket that contains more than 6M files.
I've run into this error Failed to sanitize XML document destined for handler class and i think this is the problem: https://github.com/lbroudoux/es-amazon-s3-river/issues/16
Is there a way I can limit how many files are read in the first runs?
This is what I have DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3-sat-dth-prd", table_name = "datahub_meraki_user_data", transformation_ctx = "DataSource0"), can I tell it to read only a folder in my bucket? Every folder within is called like this: partition=13/, partition=14/, partition=n/ and so on.
How can I work around this?
Thanks in advance.
There are three main ways (as I know) to avoid this situation.
1. Load from a prefix
In order to load files from a specific path in AWS Glue, you can use the below syntax.
from awsglue.dynamicframe import DynamicFrame
dynamic_frame = context.create_dynamic_frame_from_options(
"s3",
{
'paths': ['s3://my_bucket_1/my_prefix_1'],
'recurse': True,
'groupFiles': 'inPartition',
'groupSize': '1073741824'
},
format='json',
transformation_ctx='DataSource0'
)
You can put multiple paths for paths and Glue will load from all of them.
2. Use Glue Bookmarks.
When you have millions of files in a bucket and you want to load only the new files (between the runs of your Glue job), you can enable Glue Bookmarks. It will keep track of the files it read in an internal index (which we don't have access to).
You can pass this as a parameter when you define the job.
MyJob:
Type: AWS::Glue::Job
Properties:
...
GlueVersion: 2.0
Command:
Name: glueetl
PythonVersion: 3
...
DefaultArguments: {
"--job-bookmark-option": job-bookmark-enable,
...
This will enable bookmarks defined with the name used for transformation_ctx when you load data. Yes, it's confusing that AWS uses the same parameter for multiple purposes!
It's also important that you must not forget to add job.commit() at the end of your Glue script, where job is your from awsglue.job import Job instance.
Then, when you use the same context.create_dynamic_frame_from_options() function with your root prefix and the same transformation_ctx, it will only load the new files in the prefix in the hierarchy. It saves a lot of hassle for us in looking for new files. Read the docs for more information on bookmarks.
3. Avoid smaller file sizes.
AWS Glue will take ages to load files if you have quite smaller files. So, if you can control the file size, then make the files at least 100MB in size. For instance, we were writing to S3 from a Firehose stream and we could adjust the buffer size to avoid smaller file sizes. This drastically increased the loading times for our Glue job.
I hope these tips will help you. And feel free to ask any questions if you need further clarification.
There is a way to control the # of files called a BoundedExecution. It's documented here: https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html
In the following examples you would be loading in 200 files at a time. Note you must enable Glue bookmarks for this to work correctly.
If you are using from_options it looks like this:
DataSource0 = glueContext.create_dynamic_frame.from_options(
format_options={"withHeader": True, "separator": separator, "quoteChar": quoteChar},
connection_type="s3",
format="csv",
connection_options={"paths": inputFilePath,
"boundedFiles": "200", "recurse": True},
transformation_ctx="DataSource0"
)
If you are using from_catalog it looks like this:
DataSource0 = glueContext.create_dynamic_frame.from_catalog(
database = "database-name",
table_name= "table-name",
additional_options={"boundedFiles": "200"},
transformation_ctx="DataSource0"
)
Much like in S3-Bucket/Management/Lifecycles using prefixes, I'd like to prune old files that have certain words.
I'm looking to remove files that start with Screenshot or has screencast in the filename older than 365 days.
Examples
/Screenshot 2017-03-19 10.11.12.png
folder1/Screenshot 2019-03-01 14.31.55.png
folder2/sub_folder/project-screencast.mp4
I'm currently testing if lifecycle prefixes work on files too.
You can write a program to do it, such as this Python script:
import boto3
s3 = boto3.client('s3', region_name='ap-southeast-2')
response = s3.list_objects_v2(Bucket='my-bucket')
keys_to_delete = [{'Key': object['Key']}
for object in response['Contents']
if object['LastModified'] < datetime(2018, 3, 20)
and ('Screenshot' in object['Key'] or 'screencast' in object['Key'])
]
s3.delete_objects(Bucket='my-bucket', Delete={'Objects': keys_to_delete})
You could modify it to be "1 year ago" rather than a specific date.
I don't believe that you can apply lifecycle rules with wildcards such as *screencast*, only with prefixes such as "taxes/" or "taxes/2010".
For your case, I would probably write a script (or perhaps an Athena query) to filter an S3 Inventory report for those files that match your name/age conditions, and then prune them.
Of course, you could write a program to do this as #John Rotenstein suggests. The one time that might not be ideal is if you have millions or billions of objects because the time to enumerate the list of objects would be significant. But it would be fine for a reasonable number of objects.
I am running pyspark job on EMR ( 5.5.1 ) with Spark 2.1.0, Hadoop 2.7.3, Hive 2.1.1, Sqoop 1.4.6 and Ganglia 3.7.2 which is loading data from s3. There are multiple buckets that contain input files so I have a function which uses boto to traverse through them and filter them out according to some pattern.
Cluster Size: Master => r4.xlarge , Workers => 3 x r4.4xlarge
Problem:
The function getFilePaths returns a list of s3 paths which is directly fed to spark dataframe load method.
Using Dataframe
file_list = getFilePaths() # ['s3://some_bucket/log.json.gz','s3://some_bucket/log2.json.gz']
schema = getSchema() # for mapping to the json files
df = sparkSession.read.format('json').load(file_list, schema=schema)
Using RDD
master_rdd = sparkSession.sparkContext.union(
map(lambda file: sparkSession.sparkContext.textFile(file), file_list)
)
df = sparkSession.createDataFrame(master_rdd, schema=schema)
The file_list can be a huge list ( max 500k files ) due to large amount of data & files. Calculation of these paths only takes 5-20mins but when trying to load them as dataframe with spark, spark UI remains inactive for hours i.e. not processing anything at all. The inactivity period for processing 500k files is above 9hrs while for 100k files it is around 1.5hrs.
Viewing Gangilla metrics shows that only driver is running/processing while workers are idle. There are no logs generated until the spark job has finished and I haven't got any success with 500k files.
I have tried s3, s3n connectors but no success.
Question:
Figure out the root cause of this delay?
How can I debug it properly ?
In general, Spark/Hadoop prefer to have large files they can split instead of huge numbers of small files. One approach you might try though would be to parallelize your file list and then load the data in a map call.
I don't have the resources right now to test this out, but it should be something similar to this:
file_list = getFilePaths()
schema = getSchema() # for mapping to the json files
paths_rdd = sc.parallelize(file_list)
def get_data(path):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, path)
data = obj.get()['Body'].read().decode('utf-8')
return [json.loads(r) for r in data.split('\n')]
rows_rdd = rdd.flatMap(get_data)
df = spark.createDataFrame(rows_rdd, schema=schema)
You could also make this a little more efficient by using mapPartition instead so you don't need to recreate the s3 object each time.
EDIT 6/14/18:
With regards to handling the gzip data, you can decompress a stream of gzip data using python as detailed in this answer: https://stackoverflow.com/a/12572031/1461187 . Basically just pass in obj.get()['Body'].read() into the function defined in that answer.
There's two performance issues surfacing
reading the files: gzip files can't be split to have their workload shared across workers, though with 50 MB files, there's little benefit in splitting things up
The way the S3 connectors spark uses mimic a directory structure is a real performance killer for complex directory trees.
Issue #2 is what slows up partitioning: the initial code to decide what to do, which is done before any of the computation.
How would I go about trying to deal with this? Well, there's no magic switch here. But
have fewer, bigger files; as noted, Avro is good, so are Parquet and ORC later.
use a very shallow directory tree. Are these files all in one single directory? Or in a deep directory tree? The latter is worse.
Coalesce the files first.
I'd also avoid any kind of schema inference; it sounds like you aren't doing that (good!), but for anyone else reading this answer: know that for CSV and presumably JSON, schema inference means "read through all the data once just to work out the schema"
I am given a bucket on S3 that consists of Kinesis stream files in compressed format using 'snappy'. The folder structure and file format is is s3://<mybucket>/yyyy/mm/dd/*.snappy.
I am trying to read this into an sqlContext in pyspark. Normally, I would specify the bucket as:
df = sqlContext.read.json('s3://<mybucket>/inputfile.json')
How do I get all these multi-part compressed files into my data frame?
UPDATE: Seems like I made more progress using this same construct. But, running into heap size problems:
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p
kill -9 %p"
# Executing /bin/sh -c "kill 6128
kill -9 6128"...
The data size is not that big, but somehow this uncompress step seems to make things worse.
If you're trying to get all snappy files from all days / months / years, try something like this:
s3://<mybucket>/*/*/*/*.snappy
Where the first three *'s are referring to the /yyyy/mm/dd/ subfolders.
To prove this works, you can perform the following test:
created a testDirectories folder... and inside it, nested some date folders.
nestedDirectories/
-- 2016/
-- -- 12/
-- -- -- 15/
-- -- -- -- data.txt
and inside the data.txt:
hello
world
I
Have
Some
Words
And then I've run pyspark:
>>> rdd = sc.readText("/path/to/nestedDirectories/*/*/*/*.txt")
>>> rdd.count()
6
So, that star-pattern works for importing files into an RDD.
So, from here, if you have problems with memory and stuff, it may be that you have too many files that are too small of a file size. This is known as the "small files problem" https://forums.databricks.com/questions/480/how-do-i-ingest-a-large-number-of-files-from-s3-my.html