Doesn't Spark/Hadoop support SSE-KMS encryption on AWS S3 - amazon-web-services

I am trying to save an rdd on S3 with server side encryption using KMS key (SSE-KMS), But I am getting the following exception:
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 695E32175EBA568A, AWS Error
Code: InvalidArgument, AWS Error Message: The encryption method
specified is not supported, S3 Extended Request ID:
Pi+HFLg0WsAWtkdI2S/xViOcRPMCi7zdHiaO5n1f7tiwpJe2z0lPY1C2Cr53PnnUCj3358Gx3AQ=
Following is the piece of my test code to write an rdd on S3 by using SSE-KMS for encryption:
val sparkConf = new SparkConf().
setMaster("local[*]").
setAppName("aws-encryption")
val sc = new SparkContext(sparkConf)
sc.hadoopConfiguration.set("fs.s3a.access.key", AWS_ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", AWS_SECRET_KEY)
sc.hadoopConfiguration.setBoolean("fs.s3a.sse.enabled", true)
sc.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "SSE-KMS")
sc.hadoopConfiguration.set("fs.s3a.sse.kms.keyId", KMS_ID)
val s3a = new org.apache.hadoop.fs.s3a.S3AFileSystem
val s3aName = s3a.getClass.getName
sc.hadoopConfiguration.set("fs.s3a.impl", s3aName)
val rdd = sc.parallelize(Seq("one", "two", "three", "four"))
println("rdd is: " + rdd.collect())
rdd.saveAsTextFile(s"s3a://$bucket/$objKey")
Although, I am able to write rdd on s3 with AES256 encryption.
Does spark/hadoop have a different value for KMS key encryption instead of "SSE-KMS"?
Can anyone please suggest what I am missing here or doing wrong.
Environment details as follow:
Spark: 1.6.1
Hadoop: 2.6.0
Aws-Java-Sdk: 1.7.4
Thank you in advance.

Unfortunately, It seems like existing version of Hadoop i.e. 2.8 does not support SSE-KMS :(
Following is the observation:
SSE-KMS is not supported till Hadoop 2.8.1
SSE-KMS supposed to be introduced in Hadoop 2.9
In Hadoop 3.0.0alpha version, SSE-KMS is supported.
Same observation w.r.t. AWS SDK for Java
SSE-KMS was introduced in aws-java-sdk 1.9.5

Related

Distcp from S3 to HDFS

Im trying to copy data from S3 to HDFS using distcp tool. Problem with that is, that S3 cluster uses VPC endpoint and I dont know how to properly configure distcp. I have trtied several configurations but none has worked. Currently Im using following command:
hadoop distcp
-Dfs.s3a.access.key=[KEY]
-Dfs.s3a.secret.key=[SECRET]
-Dfs.s3a.region=eu-west-1
-Dfs.s3a.bucket.[BUCKET NAME].endpoint=https://bucket.vpce-[vpce id].s3.eu-west-1.vpce.amazonaws.com
s3a://[BUCKET NAME]/[FILE]
hdfs://[DESTINATION]/[FILE]
But im getint this error:
22/03/16 09:14:39 ERROR tools.DistCp: Exception encountered org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExistV2 on [BUCKET NAME]: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'vpce' is wrong; expecting 'eu-west-1'
Any ideas how Distcp should be configured with VPC endpoints?
Thanks in advance
you need hadoop 3.3.1 for this, then it should work. ideally use 3.3.2, which is now out
grab the cloudstore jar and use its storediag command to debug this before going near distcp.

Assuming role in AWS is causing a credentials error

I am trying to use Glue schema registry service in AWS with scala (or java should be useful also) and I tested two ways to assume a role but it results in an error:
"Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()])"
I don't want to use environment variables so I tried STS to assume a role with the following code:
val assumeRoleRequest = AssumeRoleRequest.builder.roleSessionName(UUID.randomUUID.toString).roleArn("roleArn").build
val stsClient = StsClient.builder.region(Region.EU_CENTRAL_1).build
val stsAssumeRoleCredentialsProvider = StsAssumeRoleCredentialsProvider.builder.stsClient(stsClient).refreshRequest(assumeRoleRequest).build
val glueClient = GlueClient
.builder()
.region(Region.EU_CENTRAL_1)
.credentialsProvider(stsAssumeRoleCredentialsProvider)
Based on https://stackoverflow.com/a/62930761/17221117
The second way I used is using the following AWS code official documentation
But it fails also... I don't understand if this generate a token that I should use or just executing this code should work.
Anyone can help me with this?

difference between Interface S3Client and Class AmazonS3Client

I am creating a method that needs a S3 client as a parameter. I do not know what type should I declare it to be.
this is the doc for S3Client https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3Client.html
Ignore since answered (this is the doc for AmazonS3Client
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Client.html My question is which type is recommended and what are difference between them? Thank you! )
Update:
I find another S3 Client here: AmazonS3 interface.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3.html
However, setObjectTagging is supported in type AmazonS3 not but in type S3Client .
Does AmazonS3 provide more functionality than S3Client?
What if I need some function in AmazonS3 not in S3Client, or some in S3Client not in AmazonS3?
The AWS SDK for Java has two versions: V1 and V2. AmazonS3Client is the older V1 version while S3Client is the newer V2 version.
Amazon recommends using V2:
The AWS SDK for Java 2.x is a major rewrite of the version 1.x code base. It’s built on top of Java 8+ and adds several frequently requested features. These include support for non-blocking I/O and the ability to plug in a different HTTP implementation at run time.
You can find Amazon S3 V2 code examples in the Java Developer V2 DEV Guide here:
Developer guide - AWS SDK for Java 2.x
(At this point, the Amazon S3 Service guide does not have V2 examples in it.)
In addition, you can find all Amazon S3 V2 code examples in AWS Github here:
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/example_code/s3
If you are not familiar developing apps by using the AWS SDK for Java V2, it's recommended that you start here:
Get started with the AWS SDK for Java 2.x
(This getting started topic happens to use the Amazon S3 Java V2 API to help get you up and running with using the AWS SDK for Java V2)
Update:
You stated: However, setObjectTagging is supported in type AmazonS3 not but in type S3Client .
The way to tag an Object in an Amazon S3 bucket by using Java V2 API is to use this code:
// First need to get existing tag set; otherwise the existing tags are overwritten.
GetObjectTaggingRequest getObjectTaggingRequest = GetObjectTaggingRequest.builder()
.bucket(bucketName)
.key(key)
.build();
GetObjectTaggingResponse response = s3.getObjectTagging(getObjectTaggingRequest);
// Get the existing immutable list - cannot modify this list.
List<Tag> existingList = response.tagSet();
ArrayList<Tag> newTagList = new ArrayList(new ArrayList<>(existingList));
// Create a new tag.
Tag myTag = Tag.builder()
.key(label)
.value(LabelValue)
.build();
// push new tag to list.
newTagList.add(myTag);
Tagging tagging = Tagging.builder()
.tagSet(newTagList)
.build();
PutObjectTaggingRequest taggingRequest = PutObjectTaggingRequest.builder()
.key(key)
.bucket(bucketName)
.tagging(tagging)
.build();
s3.putObjectTagging(taggingRequest);

Spark. Problem when writing a large file on aws s3a storage

I have an unexplained problem with uploading large files to s3a. I am using EC2 Instance with spark-2.4.4-bin-hadoop2.7 and Spark DataFrame to write to s3a with V4 version. Authenticating S3 using Access Key and Secret Key.
The procedure is as follows:
1) read csv file from s3a as the Spark DataFrame;
2) processing data;
3) upload Data Frame as format parquet to s3a
If I do the procedure with the 400MB csv file there is no problem, everything works fine. But when I do the same with a 12 GB csv file in the process of writing parquet file to s3a an error appears:
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2CA5F6E85BC36E8D, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I use the following settings:
import pyspark
from pyspark import SparkContext
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
sc = SparkContext()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
accesskey = input()
secretkey = input()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("fs.s3a.fast.upload", "true")
hadoopConf.set("fs.s3a.fast.upload", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.access.key", accesskey)
hadoopConf.set("fs.s3a.secret.key", secretkey)
also tried to add these settings:
hadoopConf.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')
hadoopConf.set('spark.speculation', "false")
hadoopConf.set('spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
hadoopConf.set('spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
but it didn’t help.
Again, the problem appears only with large file.
I would appreciate any help. Thank you.
Try setting fs.s3a.fast.upload to true,
otherwise, the multipart upload stuff was only ever experimental in 2.7; you may have hit a corner case. Upgrade to the hadoop-2.8 versions or later and it should go away.
Updated hadoop from 2.7.3 to 2.8.5 and now everything works without errors.
Had same issue. Made a Spark Cluster on EMR (5.27.0) and configured it with Spark 2.4.4 on Hadoop 2.8.5. Uploaded my notebook that had my code on it to a notebook I made in EMR JupyterLab, ran it, and it worked perfectly!

Spark streaming job using custom jar on AWS EMR fails upon write

I am trying to convert a file (csv.gz format) into parquet using streaming data frame. I have to use streaming data frames because the files compressed are ~700 MB in size. The job is run using a custom jar on AWS EMR. The source, destination and checkpoint locations are all on AWS S3. But as soon as I try to write to checkpoint the job fails with following error:
java.lang.IllegalArgumentException:
Wrong FS: s3://my-bucket-name/transformData/checkpoints/sourceName/fileType/metadata,
expected: hdfs://ip-<ip_address>.us-west-2.compute.internal:8020
There are other spark jobs running on the EMR cluster that read and write from and to S3 which run successfully (but they are not using spark streaming). So I do not think it is an issue with S3 file system access as suggested in this post. I also looked at this question but the answers do not help in my case. I am using Scala: 2.11.8 and Spark: 2.1.0.
Following is the code I have so far
...
val spark = conf match {
case null =>
SparkSession
.builder()
.appName(this.getClass.toString)
.getOrCreate()
case _ =>
SparkSession
.builder()
.config(conf)
.getOrCreate()
}
// Read CSV file into structured streaming dataframe
val streamingDF = spark.readStream
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter","|")
.option("timestampFormat", "dd-MMM-yyyy HH:mm:ss")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue","")
.schema(schema)
.load(s"s3://my-bucket-name/rawData/sourceName/fileType/*/*/fileNamePrefix*")
.withColumn("event_date", "event_datetime".cast("date"))
.withColumn("event_year", year($"event_date"))
.withColumn("event_month", month($"event_date"))
// Write the results to Parquet
streamingDF.writeStream
.format("parquet")
.option("path", "s3://my-bucket-name/transformedData/sourceName/fileType/")
.option("compression", "gzip")
.option("checkpointLocation", "s3://my-bucket-name/transformedData/checkpoints/sourceName/fileType/")
.partitionBy("event_year", "event_month")
.trigger(ProcessingTime("900 seconds"))
.start()
I have also tried to use s3n:// instead of s3:// in the URI but that does not seem to have any effect.
Tl;dr Upgrade spark or avoid using s3 as checkpoint location
Apache Spark (Structured Streaming) : S3 Checkpoint support
Also you should probably specify the write path with s3a://
A successor to the S3 Native, s3n:// filesystem, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.
https://wiki.apache.org/hadoop/AmazonS3