S3 Encryption not working from spark-submit command [duplicate] - amazon-web-services

I have configured encryption enabled EMR cluster (properties in emrfs-site.xml)
I am using dataframe savemode.append to write into S3n://my-bucket/path/
to save in s3.
But I am not able to see the object getting AWS KMS encrypted.
However, when I do a simple insert from hive from EMR, I am able to see the objects getting aws kms encrypted.
How can I encrypt files from dataframe in S3 using sse kms?

The problem was we were using s3a to save the files from spark program to EMR. AWS officially doesn't support use of s3a on EMR. Though we were able to save data in S3, it was not encrypting the data. I tried using s3:// and s3n:// The encryption works with both.

Related

Export data from Elastic search to S3 using Glue job

I wish to transfer data from ultrawarm Elastic search to s3 in the source region using AWS Glue ETL. I am having difficulty trying to find a documentation on it. Can someone help with me with it.?
You can use a custom connection type custom.spark and then set the configuration option to OpenSearch that starts with the prefix es. Read more about it and see the example in the documentation here.
Alternatives
If you have the choice it is always better to directly push from the application to S3 instead of getting the data from OpenSearch to S3.
For a full dump use the elasticsearch-dump command to copy the data from your OpenSearch cluster to your AWS S3 bucket.
For the input use your OpenSearch SERVICE_URI.
For the output, choose the Amazon S3 path including the file name that you want for your document.

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Specify KMS Key to encrypt data during datasink in AWS Glue Job

I have a scala job that read table data and writes it to a s3 bucket in parquet format. I want to encrypt the data using KMS and place in s3 bucket. I know that ETL Job run has option to select server side encryption but that uses AES256 and has no option to select KMS. Is it possible to do this?
I have already tried the solution given for a different question but it didn't work.
Similar question previously asked

PySpark s3 Access with Multiple AWS Credential Profiles?

I'm writing a pyspark job that needs to read out of two different s3 buckets. Each bucket has different credentials, which are stored on my machine as separate profiles in ~/.aws/credentials.
Is there a way to tell pyspark which profile to use when connecting to s3?
When using a single bucket, I had set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in conf/spark-env.sh. Naturally, this only works for accessing 1 of the 2 buckets.
I am aware that I could set these values manually in pyspark when required, using:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")
But would prefer a solution where these values were not hard-coded in. Is that possible?
Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.
All fs.s3a options other than a small set of unmodifiable values (currently fs.s3a.impl) can be set on a per bucket basis.
The bucket specific option is set by replacing the fs.s3a. prefix on an option with fs.s3a.bucket.BUCKETNAME., where BUCKETNAME is the name of the bucket.
When connecting to a bucket, all options explicitly set will override the base fs.s3a. values.
source http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets
s3n doesn't support the aws credentials stored in ~/.aws/credentials, you should try to use hadoop 2.7 and the new hadoop s3 impl: s3a, it is using aws sdk.
not sure if the current spark release 1.6.1 works well with hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.
for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support

How to write mapreduce program with amazon ec2 and s3

I want to analyse data stored in amazon s3, how can I write java program on amazon emr and access these data.
The data url is http://s3.amazonaws.com/aws-publicdatasets/trec/kba/FAKBA1/index.html
Amazon EMR has built-in integration with S3.
There are several guide on how to do it. here and this video