Trouble integrating EMR with S3 - amazon-web-services

I am having trouble integrating EMR with S3 i.e to implement EMRFS
EMR Version: emr-5.4.0
When I run hdfs dfs -ls s3://pathto/bucket/ I get following error
ls: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: XXXX),
S3 Extended Request ID: XXXXX**
Please guide what is that, what I am missing ?
I have done following steps
Created a KMS Key for EMR
Added EMR_EC2_DefaultRole as key users in newly creates KMS Key
Created a S3 Server Side Encryption Security Config policy for EMR
Created new Inline policy for role/EMR_EC2_DefaultRole and EMR_DefaultRole for S3 bucket access
Created a EMR cluster manually with new EMR Security policy and following configuration classification
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.serverSideEncryption.kms.keyId":"KEYID"

EMR, by default, will use instance profile credentials(EMR_EC2_DefaultRole) to access your S3 bucket. The error means this role does not have necessary permissions to access S3 bucket.
You will need to verify the IAM Role policy of that role to allow necessary S3 actions on both bucket and objects (Like s3:list*). Also check if you have any explicit Deny's etc.
http://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html
The access could also be denied because of a Bucket policy on set on the S3 bucket you are trying to access.
http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html
https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/
Your EMR cluster could be using an VPC endpoint for S3 to access S3 rather than Internet/NAT. In that case, you'll also need to verify VPC endpoint policies as well.
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html#vpc-endpoints-policies-s3

Related

Unable to connect to S3 while creating Elasticsearch snapshot repository

I am trying to register a respository on AWS S3 to store ElasticSearch snapshots.
I am following guide and ran the very first command listed in the doc.
But I am getting the error Access Denied while executing that command.
The role that is being used to perform operations on S3 is the AmazonEKSNodeRole.
I have assigned the appropriate permissions to the role to perform operations on the S3 bucket.
Also, here is another doc which suggests to use kibana for ElasticSearch version > 7.2 but I am doing the same via cURL requests.
Below is trust Policy of the role through which I am making the request to register repository in the S3 bucket.
Also, below are the screenshots of the permissions of the trusting and trusted accounts respectively -

How to write log and data in Druid Deep Storage in AWS S3

We have a druid cluster setup and now i am trying to write the indexing-logs and data into S3 deep storage.
Following are the details
druid.storage.type=s3
druid.storage.bucket=bucket-name
druid.storage.baseKey=druid/segments
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs
After running ingestion task i am getting below error
*Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HCAFAZBA85QW14Q0; S3 Extended Request ID: 2ICzpVAyFcy/PLrnsUWZBJwEo7dFl/S2lwDTMn+v83uTp71jlEe59Q4/vFhwJU5/WGMYramdSIs=; Proxy: null*)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) ~[aws-java-sdk-core-1.12.37.jar:?]
I tried to add the IAM role instance to the bucket level and same Role is running EC2 where Druid services are running.
Cam someone please guide what are the steps i am missing here.
I got it done!
I have created a new IAM role and created a policy where i have given permission to S3 bucket and subfolder
NOTE: Permission to S3 bucket is must
Example: If bucket name is "Bucket_1" and subfolder where Deep storage is configured is "deep_storage"
then make sure we should give permisson like:
**"arn:aws:s3:::Bucket_1"
"arn:aws:s3:::Bucket_1/*"**
I was missing with not giving to Bucket level permission and directly trying to give permission to sub folder level.
Also remove or comment out the below parameter from common.runtime.properties file from each servers of your Druid cluster
**druid.s3.accessKey=
druid.s3.secretKey=**
After this config I can see the data is getting successfully to S3 deep storage with IAM role and not with Secret & Access Key.

Unable to configure SageMaker execution Role with access to S3 bucket in another AWS account

Requirement: Create SakeMaker GroundTruth labeling job with input/output location pointing to S3 bucket in another AWS account
High Level Steps Followed: Lets say, Account_A: SageMaker GroundTruth labeling job and Account_B: S3 bucket
Create role AmazonSageMaker-ExecutionRole in Account_A with 3 policies attached:
AmazonSageMakerFullAccess
Account_B_S3_AccessPolicy: Policy with necessary S3 permissions to access S3 bucket in Account_B
AssumeRolePolicy: Assume role policy for arn:aws:iam::Account_B:role/Cross-Account-S3-Access-Role
Create role Cross-Account-S3-Access-Role in Account_B with 1 policy and 1 trust relationship attached:
S3_AccessPolicy: Policy with necessary S3 permissions to access S3 bucket in the this Account_B
TrustRelationship: For principal arn:aws:iam::Account_A:role/AmazonSageMaker-ExecutionRole
Error: While trying to create SakeMaker GroundTruth labeling job with IAM role as AmazonSageMaker-ExecutionRole, it throws error AccessDenied: Access Denied - The S3 bucket 'Account_B_S3_bucket_name' you entered in Input dataset location cannot be reached. Either the bucket does not exist, or you do not have permission to access it. If the bucket does not exist, update Input dataset location with a new S3 URI. If the bucket exists, give the IAM entity you are using to create this labeling job permission to read and write to this S3 bucket, and try your request again.
In your high level step 2, the approach should change to using a Resource Policy on your S3 bucket that allows account A to write to it. Rather than expecting Account A to assume a role in Account B, which I don't believe Sagemeker will do. Therefore the general approach is to do the following:
Account A Sagemaker service is given has a iam policy with a that allows access to Account B bucket. (Basically what you've done).
Account B bucket is given a resource policy that allows Account A to access it.
The following article gives additional help on this topic: How can I provide cross-account access to objects that are in Amazon S3 buckets?
Reverted back to original approach where access to the SageMaker execution role was provided through direct S3 bucket policy.
While creating the GT job from console:
(i) Expects the user creating the job also to have access to the data in cross account S3 bucket; Updated bucket policy to have access for both SageMaker execution role as well as user
(ii) Expects the manifest in own account's S3 bucket; Fails with 403 if manifest is in cross account S3 bucket even though SageMaker execution role had access to the cross account S3 bucket
While creating the GT job from CLI: Above restrictions doesn't apply and was able to create the GT job.

error 403 while creating emr cluster using my reducer and mapper?

I am trying to use my bucket to give the arguments for the EMR to create a cluster for it is giving me "All access to this object has been disabled (Service: Amazon S3; Status Code: 403; Error Code: AllAccessDisabled;"
I have used my Reducer and Mapper python files and my bucket's permission is public too
is there something wrong with my mapper and reducer files or am I missing a trick here
Make sure you've assigned your EMR cluster an IAM role that has adequate S3 access permissions. IAM enables you to grant permissions to users, groups, or resources (like your EMR cluster, in this instance) to be able to access other services or resources in AWS (like S3, which is currently giving you an access denied error).
To do this through EMRFS:
Navigate to the EMR console
click Security configurations (on left menu)
Scroll down to IAM roles for EMRFS
Enable Use IAM roles for EMRFS requests to Amazon S3
Add role mapping
Select desired IAM role (Admin)
Select whatever basis for access you prefer (User, group, or S3 bucket name prefix)
Here's a pic of what it looks like in console:
More on this available in the docs here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-emrfs-iam-roles.html

spark s3 access without configuring keys and with only IAM role

I have a HDP cluster on AWS and I have one s3(in other account) also, my hadoop version is Hadoop 3.1.1.3.0.1.0-187
Now I want to read from the s3 (which is in different account) and process, then write the result to my s3(same account as cluster).
But as per the HDP guide Here tells, I can configure only one keys of either my account or other account.
But in my case I want to configure two account keys, so How to do do that ?
Due to some security reason, other account can not change the bucket policy to add IAM role which is created in my account , Hence I tried to access like below
Configured the keys of other account
Added IAM role(which has access policy for my bucket) of my account
but Still I got below error when I tried to access my account s3 from spark write
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3
What you need is to use the EC2 instance profile role. It is an IAM role that is attached to your instance: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
You first create a role with permissions that allow s3 access. Then you attach that role to your HDP cluster(EC2 autoscaling group and EMR can both achieve that).No IAM access key configuration needed on your side, although AWS still does that for you in the background. This is the s3 "outbound" access part.
The 2nd step is to set up the bucket policy to allow cross-account access: https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
You will need to do this for each bucket in your different accounts. This is basically the "inbound" s3 access permission part.
You will encounter 400 if any part of your access(i.e., your instance profile role's permission, S3 bucket ACL, bucket policy, public access block setting and etc..) is denied in the permission chain. There are much more layers on the "inbound" side. So to start to get things working, if you are not IAM expert, try to start with a very open policy(use '*' wildcard) and then narrow things down.
If I've understood right
you want your EC2 VMs to access an S3 bucket to which the IAM role doesn't have access
your have a set of AWS login details for the external S3 bucket (login and password)
HDP3 has an default auth chain of, in order
per-bucket secrets. fs.s3a.bucket.NAME.access.key, fs.s3a.bucket.NAME.secret.key
config-wide secrets fs.s3a.access.key, fs.s3a.secret.key
env vars AWS_ACCESS_KEY and AWS_SECRET_KEY
the IAM Role (it does an HTTP GET to the 169.something server which serves up a new set of IAM role credentials at least once an hour)
What you need to try here is set up some per-bucket secrets for only the external source (either in a JCEKS file on all nodes in core-site.xml, or in the spark default. For example, if the external bucket was s3a://external, you'd have
spark.hadoop.fs.s3a.bucket.external.access.key AKAISOMETHING spark.hadoop.fs.s3a.bucket.external.secret.key SECRETSOMETHING
HDP3/Hadoop 3 can handle >1 secret in the same JCEKS file without problems. HADOOP-14507. my code. Older versions let you put username:secret in the URI, but that's such a security troublespot (everything logs those URIs as they aren't viewed as sensistive), that feature has been cut from Hadoop now. Stick to the JCEKs file with a per-bucket secret, falling back to IAM role for your own data
Note you can fiddle with the authentication list for ordering and behaviour: if you add use the TemporaryAWSCredentialsProvider then it'll support session keys as well, which is often handy.
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
</value>
</property>