Glue Dynamic Frame is way slower than regular Spark - amazon-web-services

In the image below we have the same glue job run with three different configurations in terms of how we write to S3:
We used a dynamic frame to write to S3
We used a pure spark frame to write to S3
Same as 1 but reducing the number of worker nodes from 80 to 60
All things equal, the dynamic frame took 75 minutes to do the job, regular Spark took 10 minutes. The output were 100 GB of data.
The dynamic frame is super sensitive to the number of worker nodes, failing due to memory issues after 2 hours of processing when slightly reducing the number of worker nodes. This is surpraising as we would expect Glue, being an AWS service, to handle better the S3 writing operations.
The code difference was this:
if dynamic:
df_final_dyn = DynamicFrame.fromDF(df_final, glueContext, "df_final")
glueContext.write_dynamic_frame.from_options(
frame=df_final_dyn, connection_type="s3", format="glueparquet", transformation_ctx="DataSink0",
connection_options={"path": "s3://...",
"partitionKeys": ["year", "month", "day"]})
else:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df_final.write.mode("overwrite").format("parquet").partitionBy("year", "month", "day")\
.save("s3://.../")
Why such an inefficiency?

Related

AWS Glue Pyspark Parquet write to S3 taking too long

I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns. I noticed that it takes really a long time (around a day even) just to load and write one week of data. There are months of data that needs to be written. I tried increasing the worker nodes but it does not seem to fix the problem.
My glue job currently has 60 G.1x worker nodes.
My SparkConf in the code looks like this
conf = pyspark.SparkConf().setAll([
("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2"),
("spark.speculation", "false"),
("spark.sql.parquet.enableVectorizedReader", "false"),
("spark.sql.parquet.mergeSchema", "true"),
("spark.sql.crossJoin.enabled", "true"),
("spark.sql.sources.partitionOverwriteMode","dynamic"),
("spark.hadoop.fs.s3.maxRetries", "20"),
("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
])
I believe it does succeed in writing the files into partitions, however it is taking really long to delete all the temporary spark-staging files it created. When I checked the tasks, this seems to take most of the time.
2021-04-22 03:08:50,558 INFO [Thread-6] s3n.S3NativeFileSystem (S3NativeFileSystem.java:rename(1355)): rename s3://<bucket-name>/etl/sessions/.spark-staging-8df58afd-d6b2-4ca0-8611-429125abe2ae/p_date=2020-12-16/geo=u1hm s3://<bucket-name>/etl/sessions/p_date=2020-12-16/geo=u1hm
My write to S3 looks like this
finalDF.coalesce(50).write.partitionBy('p_date','geohash').save("s3://{bucket}/{basepath}/{table}/".format(bucket=args['DataBucket'], basepath='etl',
table='sessions'), format='parquet', mode="overwrite")
Any help would be appericiated.

AWS EMR | Total numbers of Mappers when pointing to AWS S3

I am bit curious to know how EMR cluster will decide total number of mappers, if we are triggering Hive workloads pointing to S3 location. In S3 data is not stored in form of blocks, so which component will create Input splits and assigns mapper to it?
There are two ways to find the number of mappers needed to process your input data files:
The number of mappers depends on the number of Hadoop splits. If your files are smaller than HDFS or Amazon S3 split size, the number of mappers is equal to the number of files. If some or all of your files are larger than HDFS or Amazon S3 split size (fs.s3.block.size) the number of mappers is equal to the sum of each file divided by the HDFS/Amazon S3 block size.
The examples below assume 64 MB of block size (S3 or HDFS).
Example 1: You have 100 files of 60 MB each on HDFS = 100 mappers. Since each file is less than the block size, the number of mappers equals the number of files.
Example 2: You have 100 files of 80 MB each on Amazon S3 = 200 mappers. Each data file is larger than our block size, which means each file requires two mappers to process the file.
100 files * 2 mappers each = 200 mappers
Example 3: You have two 60 MB, one 120 MB, and two 10 MB files = 6 mappers. The 60 MB files require two mappers, 120 MB file requires two mappers, and two 10 MB files require a single mapper each.
An easy way to estimate the number of mappers needed is to run your job on any Amazon EMR cluster and note the number of mappers calculated by Hadoop for your job. You can see this total by looking at JobTracker GUI or at the output of your job. Here is a sample of job output with the number of mappers highlighted:
13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/13 01:12:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms) =0 13/01/13 01:12:30 INFO mapred.JobClient: Rack-local map tasks=20 13/01/13 01:12:30 INFO mapred.JobClient:
Launched map tasks=20
13/01/13 01:12:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=2329458
Reference: Amazon EMR Best Practices

Difficulty loading large dataframes from S3 jsonl data into glue for conversion to parquet: Memory Constraints and failed worker spawning

I am attempting to load large datasets from S3 in JSONL format using AWS glue. The S3 data is accessed through a glue table projection. Once they are loaded, I save them back to a different S3 location in Parquet format. For the most part this strategy works, but for some of my datasets, the glue job runs out of memory. On closer inspection, it would seem it is trying to load the entire large dataset onto one executor before redistributing it.
I have tried upgrading the worker size to G.1X from standard so that it would have more memory and CPU, and this did not work for me: the script would still crash.
It was recommended I try the techniques for partitioning input data outlined on this page:
https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
I tried to use the groupSize and groupFiles parameters, but these did not work for me: the script would still crash.
Finally, I was recommended to set --config options like in a typical spark environment so that my program could use more memory, but as I am working in AWS I am not able to set the config.
The following code demonstrates what I'm attempting to do:
datasource = glueContext.create_dynamic_frame.from_catalog(
database=SOURCE_GLUE_DATABASE,
table_name=SOURCE_TABLE
)
# yapf: disable
glueContext.write_dynamic_frame.from_options(
frame=datasource,
connection_type="s3",
connection_options={
"path": OUTPUT_PATH,
},
format="parquet"
)
The expected result ( and the one I see for most of my datasets ) is that parquet data is written out to the new S3 location. In the worst cases, looking through the logs reveals that only one worker is being used, despite having set the maximum capacity setting for the job to 10 or more. It seems that it just doesn't want to create more workers, and I can't understand why.

How to pass a bigger .csv files to amazon sagemaker for predictions using batch transform jobs

I created a custom model and deployed it on sagemaker. I am invoking the endpoint using batch transform jobs. It works if the input file is small, i.e, number of rows in the csv file is less. If I upload a file with around 200000 rows, I am getting this error in the cloudwatch logs.
2018-11-21 09:11:52.666476: W external/org_tensorflow/tensorflow/core/framework/allocator.cc:113]
Allocation of 2878368000 exceeds 10% of system memory.
2018-11-21 09:11:53.166493: W external/org_tensorflow/tensorflow/core/framework/allocator.cc:113]
Allocation of 2878368000 exceeds 10% of system memory.
[2018-11-21 09:12:02,544] ERROR in serving: <_Rendezvous of RPC that
terminated with:
#011status = StatusCode.DEADLINE_EXCEEDED
#011details = "Deadline Exceeded"
#011debug_error_string = "
{
"created": "#1542791522.543282048",
"description": "Error received from peer",
"file": "src/core/lib/surface/call.cc",
"file_line": 1017,
"grpc_message": "Deadline Exceeded",
"grpc_status": 4
}
"
Any ideas what might be going wrong. This is the transform function which I am using to create the transform job.
transformer =sagemaker.transformer.Transformer(
base_transform_job_name='Batch-Transform',
model_name='sagemaker-tensorflow-2018-11-21-07-58-15-887',
instance_count=1,
instance_type='ml.m4.xlarge',
output_path='s3://2-n2m-sagemaker-json-output/out_files/'
)
input_location = 's3://1-n2m-n2g-csv-input/smal_sagemaker_sample.csv'
transformer.transform(input_location, content_type='text/csv', split_type='Line')
The .csv file contains 2 columns for first and last name of customer, which I am then preprocessing it in the sagemaker itself using input_fn().
The error looks to be coming from a GRPC client closing the connection before the server is able to respond. (There looks to be an existing feature request for the sagemaker tensorflow container on https://github.com/aws/sagemaker-tensorflow-container/issues/46 to make this timeout configurable)
You could try out a few things with the sagemaker Transformer to limit the size of each individual request so that it fits within the timeout:
Set a max_payload to a smaller value, say 2-3 MB (the default is 6 MB)
If your instance metrics indicate it has compute / memory resources to spare, try max_concurrent_transforms > 1 to make use of multiple workers
Split up your csv file into multiple input files. With a bigger dataset, you could also increase the instance count to fan out processing
A change was made and merged in to allow users to configure the timeout through an environment variable, SAGEMAKER_TFS_GRPC_REQUEST_TIMEOUT.
https://github.com/aws/sagemaker-tensorflow-container/pull/135
https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/proxy_client.py#L30

Spark - filter v. large data (400 GB) in small cluster (16) too long to save to s3

I am doing a very simple job in a very large scale.
I have 480 GB in JSON files in an S3 bucket.
val events = spark.read.text("s3a://input/")
val filteredEvents = events filter { _.contains("...") }
filteredEvents.saveAsTextFile("s3a://output/")
After doing a lot of work for ~5 minutes, there is a last task that takes forever. I can see a lot of partial files on the S3 bucket but the job is not finished yet; there is a temporary folder and no success message. I waited for ~20 minutes and no change. Just this one last task that shows a huge scheduler delay.
I suppose this might be the workers sending data back to the scheduler; can't each worker write directly to S3?
My cluster has 16 m3.2xlarge nodes.
Am I trying a job to big with a small cluster?