Spark streaming job using custom jar on AWS EMR fails upon write - amazon-web-services

I am trying to convert a file (csv.gz format) into parquet using streaming data frame. I have to use streaming data frames because the files compressed are ~700 MB in size. The job is run using a custom jar on AWS EMR. The source, destination and checkpoint locations are all on AWS S3. But as soon as I try to write to checkpoint the job fails with following error:
java.lang.IllegalArgumentException:
Wrong FS: s3://my-bucket-name/transformData/checkpoints/sourceName/fileType/metadata,
expected: hdfs://ip-<ip_address>.us-west-2.compute.internal:8020
There are other spark jobs running on the EMR cluster that read and write from and to S3 which run successfully (but they are not using spark streaming). So I do not think it is an issue with S3 file system access as suggested in this post. I also looked at this question but the answers do not help in my case. I am using Scala: 2.11.8 and Spark: 2.1.0.
Following is the code I have so far
...
val spark = conf match {
case null =>
SparkSession
.builder()
.appName(this.getClass.toString)
.getOrCreate()
case _ =>
SparkSession
.builder()
.config(conf)
.getOrCreate()
}
// Read CSV file into structured streaming dataframe
val streamingDF = spark.readStream
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter","|")
.option("timestampFormat", "dd-MMM-yyyy HH:mm:ss")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue","")
.schema(schema)
.load(s"s3://my-bucket-name/rawData/sourceName/fileType/*/*/fileNamePrefix*")
.withColumn("event_date", "event_datetime".cast("date"))
.withColumn("event_year", year($"event_date"))
.withColumn("event_month", month($"event_date"))
// Write the results to Parquet
streamingDF.writeStream
.format("parquet")
.option("path", "s3://my-bucket-name/transformedData/sourceName/fileType/")
.option("compression", "gzip")
.option("checkpointLocation", "s3://my-bucket-name/transformedData/checkpoints/sourceName/fileType/")
.partitionBy("event_year", "event_month")
.trigger(ProcessingTime("900 seconds"))
.start()
I have also tried to use s3n:// instead of s3:// in the URI but that does not seem to have any effect.

Tl;dr Upgrade spark or avoid using s3 as checkpoint location
Apache Spark (Structured Streaming) : S3 Checkpoint support
Also you should probably specify the write path with s3a://
A successor to the S3 Native, s3n:// filesystem, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.
https://wiki.apache.org/hadoop/AmazonS3

Related

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
---------------------S3 image ---------------------
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1
They are actually directory markers as path + /. Source 2
To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3 , S3a and S3n here and here
Thanks to #stevel 's comment here
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

Spark. Problem when writing a large file on aws s3a storage

I have an unexplained problem with uploading large files to s3a. I am using EC2 Instance with spark-2.4.4-bin-hadoop2.7 and Spark DataFrame to write to s3a with V4 version. Authenticating S3 using Access Key and Secret Key.
The procedure is as follows:
1) read csv file from s3a as the Spark DataFrame;
2) processing data;
3) upload Data Frame as format parquet to s3a
If I do the procedure with the 400MB csv file there is no problem, everything works fine. But when I do the same with a 12 GB csv file in the process of writing parquet file to s3a an error appears:
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2CA5F6E85BC36E8D, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I use the following settings:
import pyspark
from pyspark import SparkContext
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
sc = SparkContext()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
accesskey = input()
secretkey = input()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("fs.s3a.fast.upload", "true")
hadoopConf.set("fs.s3a.fast.upload", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.access.key", accesskey)
hadoopConf.set("fs.s3a.secret.key", secretkey)
also tried to add these settings:
hadoopConf.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')
hadoopConf.set('spark.speculation', "false")
hadoopConf.set('spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
hadoopConf.set('spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
but it didn’t help.
Again, the problem appears only with large file.
I would appreciate any help. Thank you.
Try setting fs.s3a.fast.upload to true,
otherwise, the multipart upload stuff was only ever experimental in 2.7; you may have hit a corner case. Upgrade to the hadoop-2.8 versions or later and it should go away.
Updated hadoop from 2.7.3 to 2.8.5 and now everything works without errors.
Had same issue. Made a Spark Cluster on EMR (5.27.0) and configured it with Spark 2.4.4 on Hadoop 2.8.5. Uploaded my notebook that had my code on it to a notebook I made in EMR JupyterLab, ran it, and it worked perfectly!

Spark doesn't output .crc files on S3

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?
those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

Using AWS S3 and Apache Spark with hdf5/netcdf-4 data

I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:
-using h5py to read locally stored scientific data files via h5py.File(filename) (https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)
-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()
-map a set of text files to an RDD via keys.flatMap(mapFunc)
But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)?
Here's a couple of similar questions that weren't answered fully that relate:
How to read binary file on S3 using boto?
using pyspark, read/write 2D images on hadoop file system
Using the netCDF4 and s3fs modules, you can do:
from netCDF4 import Dataset
import s3fs
s3 = s3fs.S3FileSystem()
filename = 's3://bucket/a_file.nc'
with s3.open(filename, 'rb') as f:
nc_bytes = f.read()
root = Dataset(f'inmemory.nc', memory=nc_bytes)
Make sure you are setup to read from S3. For details, here is the documentation.