spark read from different account s3 and write to my account s3 - amazon-web-services

I have spark job which needs to read the data from s3 which is in other account**(Data Account)** and process that data.
once its processed it should write back to s3 which is in my account.
So I configured access and secret key of "Data account" like below in my spark session
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key","DataAccountKey")
hadoopConf.set("fs.s3a.secret.key","DataAccountSecretKey")
hadoopConf.set("fs.s3a.endpoint", "s3.ap-northeast-2.amazonaws.com")
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
val df = spark.read.json("s3a://DataAccountS/path")
/* Reading is success */
df.take(3).write.json("s3a://myaccount/test/")
with this reading is fine, but I am getting below error when writing.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 301, AWS Service: Amazon S3, AWS Request ID: A5E574113745D6A0, AWS Error Code: PermanentRedirect, AWS Error Message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
but If I dont configure details of Data Account and try to write some dummy data to my s3 from spark it works.
So how should I configure to make both reading from different account s3 and writing to my account s3 works

If your spark classpath has hadoop-2.7 JARs on, you can use secrets-in-Paths as the technique, so a URL like s3a://DataAccountKey:DataAccountSecretKey/DataAccount/path. Be aware this will log the secrets everywhere.
Hadoop 2.8+ JARs will tell you off for logging your secrets everywhere, but adds per-bucket binding
spark.hadoop.fs.s3a.bucket.DataAccount.access.key DataAccountKey
spark.hadoop.fs.s3a.bucket.DataAccount.secret.key DataAccountSecretKey
spark.hadoop.fs.s3a.bucket.DataAccount.endpoint s3.ap-northeast-2.amazonaws.com
then for all interaction with that bucket, these per-bucket options will override the main settings.
Note: if you want to use this, don't think dropping hadoop-aws-2.8.jar into your classpath will work, you'll only get classpath errors. All of hadoop-* JAR needs to go to 2.8 and the aws-sdk updated too.

Related

Unable to connect to S3 while creating Elasticsearch snapshot repository

I am trying to register a respository on AWS S3 to store ElasticSearch snapshots.
I am following guide and ran the very first command listed in the doc.
But I am getting the error Access Denied while executing that command.
The role that is being used to perform operations on S3 is the AmazonEKSNodeRole.
I have assigned the appropriate permissions to the role to perform operations on the S3 bucket.
Also, here is another doc which suggests to use kibana for ElasticSearch version > 7.2 but I am doing the same via cURL requests.
Below is trust Policy of the role through which I am making the request to register repository in the S3 bucket.
Also, below are the screenshots of the permissions of the trusting and trusted accounts respectively -

Spark credential chain ordering - S3 Exception Forbidden

I'm running Spark 2.4 on an EC2 instance. I am assuming an IAM role and setting the key/secret key/token in the sparkSession.sparkContext.hadoopConfiguration, along with the credentials provider as "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider".
When I try to read a dataset from s3 (using s3a, which is also set in the hadoop config), I get an error that says
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 7376FE009AD36330, AWS Error Code: null, AWS Error Message: Forbidden
read command:
val myData = sparkSession.read.parquet("s3a://myBucket/myKey")
I've repeatedly checked the S3 path and it's correct. My assumed IAM role has the right privileges on the S3 bucket. The only thing I can figure at this point is that spark has some sort of hidden credential chain ordering and even though I have set the credentials in the hadoop config, it is still grabbing credentials from somewhere else (my instance profile???). But I have no way to diagnose that.
Any help is appreciated. Happy to provide any more details.
spark-submit will pick up your env vars and set them as the fs.s3a access +secret + session key, overwriting any you've already set.
If you only want to use the IAM credentials, just set fs.s3a.aws.credentials.provider to com.amazonaws.auth.InstanceProfileCredentialsProvider; it'll be the only one used
Further Reading: Troubleshooting S3A

How to configure Spark running in local-mode on Amazon EC2 to use the IAM rules for S3

I'm running Spark2 in local mode on a Amazon EC2, when I'm trying to read data from S3 I'm getting the following exception:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively)
I can, but I rather not manually set the AccessKey and the SecretKey from the code because of security issues.
The EC2 is set with an IAM rule that allow it full access to the relevant S3 Bucket. For every other Amazon API calls it is sufficient but it seems that the spark is ignoring it.
Can I set the spark to use this IAM rule instead of the AccessKey and the SecretKey?
Switch to using the s3a:// scheme (with the Hadoop 2.7.x JARs on your classpath) and this happens automatically. The "s3://" scheme with non-EMR versions of spark/hadoop is not the connector you want (it's old, non-interoperable and has been removed from recent versions)
I am using hadoop-2.8.0 and spark-2.2.0-bin-hadoop2.7.
Spark-S3-IAM integration is working well with the following AWS packages on driver.
spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 ...
Scala codes snippet:
sc.textFile("s3a://.../file.gz").count()

S3 download works from console, but not from commandline

Can anyone explain this behaviour:
When I try to download a file from S3, I get the following error:
An error occurred (403) when calling the HeadObject operation: Forbidden.
Commandline used:
aws s3 cp s3://bucket/raw_logs/my_file.log .
However, when I use the S3 console website, I'm able to download the file without issues.
The access key used by the commandline is correct. I verified this, and other AWS operations via commandline work fine. The access key is tied to the same user account I use in the AWS console.
So I assume you're sure about the IAM policy of your user and the file exists in your bucket
If you have set a default region in your configuration but the bucket has not been created in this region (Yes s3 buckets are created in a region), it will not find it. Make sure to add the region flag to the CLI
aws s3 cp s3://bucket/raw_logs/my_file.log . --region <region of the bucket>
Other notes:
make sure to upgrade to latest version
can be cause if system clock is not synchronized, if you're not indicating any synchronize params, it might be ok but I dont know the internal and for some commands the CLI is looking at the system clock to compare to S3, if you're out of sync it might cause issues
I had a similar issue due to having two-factor authentication enabled on my account. Check out how to configure 2FA for the aws cli here: https://aws.amazon.com/premiumsupport/knowledge-center/authenticate-mfa-cli/

Error During downloading files from S3 to EC2

While i am using "wget" command to download files from amazon S3 to amazon EC2 instance,
it gives following message and file not get downloaded.
How to solve this issue..?
Command :->
"wget https://s3.amazonaws.com/docsbucket/intro.doc"
Error Message :->
"Resolving s3.amazonaws.com... 207.171.163.225
Connecting to s3.amazonaws.com|207.171.163.225|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-03-20 13:06:00 ERROR 403: Forbidden."
You should launch your EC2 instance with the permission to read from your S3 buckets.
The easiest way to do it is using Roles. You simply create in IAM (Identity and access management) service of AWS a role that can read from S3. Then you launch your instance with this role. AWS will take care of getting the right credentials onto the instance and you can get your S3 objects, using S3 CLI tools.
You can use the same "trick" to access other resources and other actions on these resources.
You can read more about it in AWS documentations: http://docs.aws.amazon.com/IAM/latest/UserGuide/role-usecase-ec2app.html
Unless the file is public, you will need to authenticate with keys to download the file. This is probably easiest done with a tool like s3cmd.
This worked after I gave read Permission to everyone for the file
Go to Permission Tab - >Public Access->Click Everyone-> then give the Read Permission