I have a scala job that read table data and writes it to a s3 bucket in parquet format. I want to encrypt the data using KMS and place in s3 bucket. I know that ETL Job run has option to select server side encryption but that uses AES256 and has no option to select KMS. Is it possible to do this?
I have already tried the solution given for a different question but it didn't work.
Similar question previously asked
Related
Is there a way to export data from a SQL Server query to an AWS (S3) bucket in csv file?
I created the bucket
arn:aws:s3:::s3tintegration
https://s3tintegration.s3.sa-east-1.amazonaws.com/key_prefix/
Can anybody help me?
If you are looking for Automated solution then there are several option in AWS .
Schedule or trigger lambda that will connect to RDS execute query and save as csv file s3 bucket .Please remember aws lambda has to be in same vpc and subnet where your SQL server is .
If you have query that takes long time you can use AWS Glue to run a task and write output to S3 in CSV format .Glue can use JDBC connection as well .
You can also use DMS that will connect SQL server as source and S3 as target in CSV format .You need to lean DMS that can migrate full table or part of it but not query .
If you are familiar with Big data you can very well use hive that will run your query and write to s3 in CSV format .
The quick and easiest way to start with is Lambda .
need little help to find better solution for my use case below.
I have S3 bucket which contain input Data and it is encrypted with KMS KEY 1
so I am able to set the KMS KEY 1 to my spark session using "spark.hadoop.fs.s3.serverSideEncryption.kms.keyId"
and able to read the data,
now I want to write the data to another S3 bucket but it is encrypted with KMS KEY 2*
so what I am currently doing is, creating spark session with Key1 and read the data frame and convert that into Pandas data frame and kill the spark session and recreate the spark session with in same AWS glue job with KMS KEY2 and converting the pandas data which was created in previous step in to spark data frame and writing to output S3 bucket.
but this approach is causing datatype issues sometimes. is there any better alternate solution available to handle this use case ?
thanks in advance and your help is greatly appreciated.
you don't need to declare what key to use to decrypt data encrypted with S3-KMS; the keyID to use is attached as an attribute to the file. AWS S3 reads the encryption settings, sees the key ID, sends off the KMS-encrypted symmetric key to AWS KMS asking for that to be decrypted with the user/IAM role asking for the decryption. If the user/role has the right permission, S3 gets the unencrypted key back, decrypts the file and returns it.
To read data from bucket encrypted with KMS-1, you should be able to set the key to the key2 value, (or no encryption at all), and still get the data back
Disclaimer: I haven't tested this with the EMR s3 connector, just the apache S3A one, but since S3-KMS works the same everywhere, I'd expect this to hold. Encryption with client supplied keys S3-CSE is a different story. There you do need the clients correctly configured, which is why S3A supports per-bucket configuration.
I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.
You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue
Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.
I have 3 types of csv in my s3 bucket and want to flow them into respective redshift tables based on csv prefix . I am thinking to use Kinesis to stream data to redshift as file in s3 will be dropped every 5 min. I am all new to aws and not sure how to achieve this.
I have gone through aws documentation but not sure how to achieve this
I have configured encryption enabled EMR cluster (properties in emrfs-site.xml)
I am using dataframe savemode.append to write into S3n://my-bucket/path/
to save in s3.
But I am not able to see the object getting AWS KMS encrypted.
However, when I do a simple insert from hive from EMR, I am able to see the objects getting aws kms encrypted.
How can I encrypt files from dataframe in S3 using sse kms?
The problem was we were using s3a to save the files from spark program to EMR. AWS officially doesn't support use of s3a on EMR. Though we were able to save data in S3, it was not encrypting the data. I tried using s3:// and s3n:// The encryption works with both.