IAM Identity of AWS glue job - AccessDenied Exception - amazon-web-services

I have a glue job that reads data from glue catalog table, and writes it back into s3 in Delta format.
IAM role of the glue job has s3:PutObject, List, Describe and all other permissions needed to interact with s3 (read and write). However, I keep running into the error -
2022-12-14 13:48:09,274 ERROR [Thread-9] output.FileOutputCommitter (FileOutputCommitter.java:setupJob(360)): Mkdirs failed to create glue-d-xxx-data-catalog-t--m-w:///_temporary/0
2022-12-14 13:48:13,875 WARN [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(73)): Lost task 5.0 in stage 1.0 (TID 6) (172.34.113.239 executor 2): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: >
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HG7ST1B44A6G30JC; S3 Extended Request ID: tR1CgoC1RcXZHEEZZ1DuDOvwIAmqC0+flXRd1ccdsY3C8PyjkEpS4wHDaosFoKpRskfH1Del/NA=; Proxy: null)
This error does not appear when I open up s3 bucket access with wildcard(principal:*) in the s3 bucket permissions section. Job fails even if I change the principal section to the same role as Glue jobs are associated with.
Now, my question is - is there a different identify that AWS Glue assumes to run the job. The IAM role associated with the job has all the permissions to interact with s3 but it throws above AccessDenied exception ( failed to create directory). However, job succeeds with wildcard(*) on s3 permissions.
Just to add some more context - this error does not happen when I am using native glue constructs like dynamic frame, spark data frame to read, process and persist data into s3. It only happens with delta format.
Below is the samplec code
src_dyf = glueContext.create_dynamic_frame.from_catalog(database="<db_name>", table_name="<>table_name_glue_catalog")
dset_df = src_dyf.toDF() # dynamic frame to dta frame conversion
# write data frame into s3 prefix in delta format.
glueContext.write_data_frame_from_catalog(
frame=dset_df,
database="xxx_data_catalog",
table_name="<tbale_name>",
additional_options=additional_options #contains key-value pair of s3 path with key, path
)

Related

AWS Glue Job - From S3 bucket to Redshift throws No Such Bucket

I'm trying to run a glue job from a data catalog that I created previously to Redshift. And It's throwing this error:
An error occurred while calling o151.pyWriteDynamicFrame. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket;
Notes:
I have PowerUser access Role, so I have permission
The bucket exists
I have a connection between glue and Redshift
It's in the same region

AWS Glue Spark job failing on DataFrame persist()

I have an AWS Glue Spark job that fails with the following error:
An error occurred while calling o362.cache. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ...; S3 Extended Request ID: ...; Proxy: null), S3 Extended Request ID: ...
I believe the error is thrown at line where the Spark persist() method is called on a DataFrame. The Glue job is assigned an IAM role that has full S3 access (all locations/operations allowed), yet I'm still getting the S3 exception. I tried setting the "Temporary path" for the Glue job on the AWS Console to a specific S3 bucket with full access, I also tried setting the Spark temporary directory to a specific S3 bucket with full access via:
conf = pyspark.SparkConf()
conf.set('spark.local.dir', 's3://...')
self.sc = SparkContext(conf=conf)
which didn't help. It's very strange that the job is failing even with full S3 access. Not sure what to try next, any help would be really appreciated. Thank you!

How to write log and data in Druid Deep Storage in AWS S3

We have a druid cluster setup and now i am trying to write the indexing-logs and data into S3 deep storage.
Following are the details
druid.storage.type=s3
druid.storage.bucket=bucket-name
druid.storage.baseKey=druid/segments
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs
After running ingestion task i am getting below error
*Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HCAFAZBA85QW14Q0; S3 Extended Request ID: 2ICzpVAyFcy/PLrnsUWZBJwEo7dFl/S2lwDTMn+v83uTp71jlEe59Q4/vFhwJU5/WGMYramdSIs=; Proxy: null*)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) ~[aws-java-sdk-core-1.12.37.jar:?]
I tried to add the IAM role instance to the bucket level and same Role is running EC2 where Druid services are running.
Cam someone please guide what are the steps i am missing here.
I got it done!
I have created a new IAM role and created a policy where i have given permission to S3 bucket and subfolder
NOTE: Permission to S3 bucket is must
Example: If bucket name is "Bucket_1" and subfolder where Deep storage is configured is "deep_storage"
then make sure we should give permisson like:
**"arn:aws:s3:::Bucket_1"
"arn:aws:s3:::Bucket_1/*"**
I was missing with not giving to Bucket level permission and directly trying to give permission to sub folder level.
Also remove or comment out the below parameter from common.runtime.properties file from each servers of your Druid cluster
**druid.s3.accessKey=
druid.s3.secretKey=**
After this config I can see the data is getting successfully to S3 deep storage with IAM role and not with Secret & Access Key.

Glue can't read S3 bucket

Description
Synced the data from other account by rclone, enabled the 'acl=bucket-owner-full-control'.
rclone sync 607562784642://cdh-bba-itdata-sub-cmdb-src-lt7g 162611943124://bbatest
When I cataloged the bucket data into Glue catalog by Crawler. Glue Crawler raised the following error
[49b1d1bd-d3f0-4801-9668-04f8651b06f4] ERROR : Not all read errors will be logged. com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: CD0062EA0B2D0AA7; S3 Extended Request ID: k0oHoKviPcWAs8yrn+9daImiTZ0Fx6sssbGiPF/7YwTjxUwITSDQHd2uTgh3K6QAcxDkvzHREJA=), S3 Extended Request ID: k0oHoKviPcWAs8yrn+9daImiTZ0Fx6sssbGiPF/7YwTjxUwITSDQHd2uTgh3K6QAcxDkvzHREJA=
Official Check list
I have checked items as per Official Check list.
bucket owner ID
object owner ID
Both of them were same. There wasn't additional bucket policy.
vpc endpoints
bucket policy
IAM policy
All policy didn't block glue to access S3 bucket.
The Crawler cataloged other bucket data successfully. So the glue configuration was correct.
The bucket enabled customer managed key.
But I forgot to add glue role to kms.

AWS Glue cannot create database from crawler: permission denied

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM role, and run it, and I get the following error:
Database does not exist or principal is not authorized to create tables. (Database name: zzz-db, Table name: avroavro_all) (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 78fc18e4-c383-11e9-a86f-736a16f57a42). For more information, see Setting up IAM Permissions in the Developer Guide (http://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html).
I tried to create this table in a new blank database (as opposed to an existing one with tables), I tried prefixing the names, I tried sourcing different schemas, and I tried using an existing role with Admin access. I though the latter would work, but I keep getting the same error, and have no idea why.
To be explicit, the service role I created has several policies I assume a premissive enough to create tables:
The logs are vanilla:

19:52:52
[10cb3191-9785-49dc-8935-fb02dcbd69a3] BENCHMARK : Running Start Crawl for Crawler avro
19:53:22
[10cb3191-9785-49dc-8935-fb02dcbd69a3] BENCHMARK : Classification complete, writing results to database zzz-db
19:53:22
[10cb3191-9785-49dc-8935-fb02dcbd69a3] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
19:53:34
[10cb3191-9785-49dc-8935-fb02dcbd69a3] ERROR : Insufficient Lake Formation permission(s) on s3://zzz-data/avro-all/ (Database name: zzz-db, Table name: avroavro_all) (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 31481e7e-c384-11e9-a6e1-e78dc8223fae). For more information, see Setting up IAM Permissions in the Developer Guide (http://docs.aws.amazon.com/glu
19:54:44
[10cb3191-9785-49dc-8935-fb02dcbd69a3] BENCHMARK : Crawler has finished running and is in state READY
I had the same problem when I setup and ran a new AWS crawler after enabling Lake Formation (in the same AWS account). I've been running Glue crawler for a long time and was stumped when I saw this new error.
After some trial and error, I found that the root cause of the problem is when you enable Lake Formation, it adds an additional layer of permission on new Glue database(s) that are created via Glue Crawler and to any resource (Glue catalog, S3, etc) that you add it to the Lake Formation service.
To fix this problem, you have to grant the Crawler's IAM role, a proper set of Lake Formation permissions (CRUD) for the database.
You can manage these permissions in AWS Lake Formation console (UI) under the Permissions > Data permissions section or via awscli lake formation commands.
I solved this problem by adding a grant in AWS Lake Formations -> Permissions -> Data locations. (Do not forget to add a forward slash (/) behind the bucket name)
I had to add the custom role I created for Glue to the "Data lake Administrators" grantees:
(Note: just saying this solves the crawler's denied access. There may be something with lesser privileges to do...)
Make sure you gave the necessary permissions to your crawler's IAM role in this path:
Lake Formation -> Permissions -> Data lake permissions
(Grant related Glue Database permissions to your crawler's IAM role)