Access two different S3 buckets with two different credentials in Spark - amazon-web-services

I have a scenario where I read data from S3 Bucket from Account 1. This works as expected as I use TemporaryAWSCredentialsProvider. I have another S3 bucket in AWS Account 2 which is where I am running Spark (Spark on EKS). How to configure credentials for Spark to access or write event logs to S3 bucket in Account 2? I was using InstanceProfileCredentialsProvider since credentials for both the buckets are different and the IAM role is configured to allow the EKS instances/pods to have access to the bucket in this account, which seems to not work as expected and I think this is because I have both InstanceProfileCredentialsProvider and TemporaryAWSCredentialsProvider in the same SparkConf which might not be allowed. Is there a way to achieve this?

Related

Unable to configure SageMaker execution Role with access to S3 bucket in another AWS account

Requirement: Create SakeMaker GroundTruth labeling job with input/output location pointing to S3 bucket in another AWS account
High Level Steps Followed: Lets say, Account_A: SageMaker GroundTruth labeling job and Account_B: S3 bucket
Create role AmazonSageMaker-ExecutionRole in Account_A with 3 policies attached:
AmazonSageMakerFullAccess
Account_B_S3_AccessPolicy: Policy with necessary S3 permissions to access S3 bucket in Account_B
AssumeRolePolicy: Assume role policy for arn:aws:iam::Account_B:role/Cross-Account-S3-Access-Role
Create role Cross-Account-S3-Access-Role in Account_B with 1 policy and 1 trust relationship attached:
S3_AccessPolicy: Policy with necessary S3 permissions to access S3 bucket in the this Account_B
TrustRelationship: For principal arn:aws:iam::Account_A:role/AmazonSageMaker-ExecutionRole
Error: While trying to create SakeMaker GroundTruth labeling job with IAM role as AmazonSageMaker-ExecutionRole, it throws error AccessDenied: Access Denied - The S3 bucket 'Account_B_S3_bucket_name' you entered in Input dataset location cannot be reached. Either the bucket does not exist, or you do not have permission to access it. If the bucket does not exist, update Input dataset location with a new S3 URI. If the bucket exists, give the IAM entity you are using to create this labeling job permission to read and write to this S3 bucket, and try your request again.
In your high level step 2, the approach should change to using a Resource Policy on your S3 bucket that allows account A to write to it. Rather than expecting Account A to assume a role in Account B, which I don't believe Sagemeker will do. Therefore the general approach is to do the following:
Account A Sagemaker service is given has a iam policy with a that allows access to Account B bucket. (Basically what you've done).
Account B bucket is given a resource policy that allows Account A to access it.
The following article gives additional help on this topic: How can I provide cross-account access to objects that are in Amazon S3 buckets?
Reverted back to original approach where access to the SageMaker execution role was provided through direct S3 bucket policy.
While creating the GT job from console:
(i) Expects the user creating the job also to have access to the data in cross account S3 bucket; Updated bucket policy to have access for both SageMaker execution role as well as user
(ii) Expects the manifest in own account's S3 bucket; Fails with 403 if manifest is in cross account S3 bucket even though SageMaker execution role had access to the cross account S3 bucket
While creating the GT job from CLI: Above restrictions doesn't apply and was able to create the GT job.

Accessing S3 bucket data from EC2 instance through IAM

So I have created an IAM user and added a permission to access S3 then I have created an EC2 instance and SSH'ed into the it.
After giving "aws s3 ls" command, the reply was
"Unable to locate credentials. You can configure credentials by running "aws configure".
so what's the difference between giving IAM credentials(Key and Key ID) using "aws configure" and editing the bucket policy to allow s3 access to my instance's public IP.
Even after editing the bucket policy(JSON) to allow S3 access to my instance's public IP why am I not able to access the s3 bucket unless I use "aws configure"(Key and Key ID)?
Please help! Thanks.
Since you are using EC2 you should really use EC2 Instance Profiles instead of running aws configure and hard-coding credentials in the file system.
As for the your question of S3 bucket policies versus IAM roles, here is the official documentation on that. They are two separate tools you would use in securing your AWS account.
As for your specific command that failed, note that the AWS CLI tool will always try to look for credentials by default. If you want it to skip looking for credentials you can pass the --no-sign-request argument.
However, if you were just running aws s3 ls then that was trying to list all the buckets in your account, which you would have to have IAM credentials for. Individual bucket policies would not be taken into account in that scenario.
If you were running aws s3 ls s3://bucketname then that may have worked as aws s3 ls s3://bucketname --no-sign-request.
When you create iam user so there are two parts
policies
roles
Policies are attached to user, like what all services user can pr can't access
roles are attached to application, what all access that application can have
So you have to permit ec2 to access S3
There are two ways for that
aws configure
attach role to ec2 instance
while 1 is tricky and legthy , 2 is easy
Go to ec2-instance-> Actions -> Security -> Modify IAM role -> then select role (ec2+s3 access role)
thats it , you can simply do aws s3 ls from ec2 instance

How to collect logs from EC2 instance and store it in S3 bucket?

Step 1: Created an Amazon S3 Bucket
Step 2: Created an IAM User with Full Access to Amazon S3 and CloudWatch Logs
Step 3: Granted Permissions on an Amazon S3 Bucket
What should I do next?
A few things.
You're probably better off using an IAM instance profile. That way, your credentials are not static IAM user credentials.
If you want to only copy the logs to S3, I'd suggest setting up some scheduled job to use the AWS CLI to copy the directory with your logs to S3.
Alternatively, I'd suggest you install and configure the CloudWatch agent on your instance. From there, you can copy logs to S3 using the methodology outlined here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasks.html

spark s3 access without configuring keys and with only IAM role

I have a HDP cluster on AWS and I have one s3(in other account) also, my hadoop version is Hadoop 3.1.1.3.0.1.0-187
Now I want to read from the s3 (which is in different account) and process, then write the result to my s3(same account as cluster).
But as per the HDP guide Here tells, I can configure only one keys of either my account or other account.
But in my case I want to configure two account keys, so How to do do that ?
Due to some security reason, other account can not change the bucket policy to add IAM role which is created in my account , Hence I tried to access like below
Configured the keys of other account
Added IAM role(which has access policy for my bucket) of my account
but Still I got below error when I tried to access my account s3 from spark write
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3
What you need is to use the EC2 instance profile role. It is an IAM role that is attached to your instance: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
You first create a role with permissions that allow s3 access. Then you attach that role to your HDP cluster(EC2 autoscaling group and EMR can both achieve that).No IAM access key configuration needed on your side, although AWS still does that for you in the background. This is the s3 "outbound" access part.
The 2nd step is to set up the bucket policy to allow cross-account access: https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
You will need to do this for each bucket in your different accounts. This is basically the "inbound" s3 access permission part.
You will encounter 400 if any part of your access(i.e., your instance profile role's permission, S3 bucket ACL, bucket policy, public access block setting and etc..) is denied in the permission chain. There are much more layers on the "inbound" side. So to start to get things working, if you are not IAM expert, try to start with a very open policy(use '*' wildcard) and then narrow things down.
If I've understood right
you want your EC2 VMs to access an S3 bucket to which the IAM role doesn't have access
your have a set of AWS login details for the external S3 bucket (login and password)
HDP3 has an default auth chain of, in order
per-bucket secrets. fs.s3a.bucket.NAME.access.key, fs.s3a.bucket.NAME.secret.key
config-wide secrets fs.s3a.access.key, fs.s3a.secret.key
env vars AWS_ACCESS_KEY and AWS_SECRET_KEY
the IAM Role (it does an HTTP GET to the 169.something server which serves up a new set of IAM role credentials at least once an hour)
What you need to try here is set up some per-bucket secrets for only the external source (either in a JCEKS file on all nodes in core-site.xml, or in the spark default. For example, if the external bucket was s3a://external, you'd have
spark.hadoop.fs.s3a.bucket.external.access.key AKAISOMETHING spark.hadoop.fs.s3a.bucket.external.secret.key SECRETSOMETHING
HDP3/Hadoop 3 can handle >1 secret in the same JCEKS file without problems. HADOOP-14507. my code. Older versions let you put username:secret in the URI, but that's such a security troublespot (everything logs those URIs as they aren't viewed as sensistive), that feature has been cut from Hadoop now. Stick to the JCEKs file with a per-bucket secret, falling back to IAM role for your own data
Note you can fiddle with the authentication list for ordering and behaviour: if you add use the TemporaryAWSCredentialsProvider then it'll support session keys as well, which is often handy.
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
</value>
</property>

Creating IAM Role on S3/Lambda

Everywhere I can see IAM Role is created on EC2 instance and given Roles like S3FullAccess.
Is it possible to create IAM Role on S3 instead of EC2? And attach that Role to S3 bucket?
I created IAM Role on S3 with S3FULLACCESS. Not able to attach that to the existing bucket or create a new bucket with this Role. Please help
IAM (Identity and Access Management) Roles are a way of assigning permissions to applications, services, EC2 instances, etc.
Examples:
When a Role is assigned to an EC2 instance, credentials are passed to software running on the instance so that they can call AWS services.
When a Role is assigned to an Amazon Redshift cluster, it can use the permissions within the Role to access data stored in Amazon S3 buckets.
When a Role is assigned to an AWS Lambda function, it gives the function permission to call other AWS services such as S3, DynamoDB or Kinesis.
In all these cases, something is using the credentials to call AWS APIs.
Amazon S3 never requires credentials to call an AWS API. While it can call other services for Event Notifications, the permissions are actually put on the receiving service rather than S3 as the requesting service.
Thus, there is never any need to attach a Role to an Amazon S3 bucket.
Roles do not apply to S3 as it does with EC2.
Assuming #Sunil is asking if we can restrict access to data in S3.
In that case, we can either Set S3 ACL on the buckets or the object in it OR Set S3 bucket policies.