AWS Glue Spark job failing on DataFrame persist()

AWS Glue Spark job failing on DataFrame persist() - amazon-web-services

I have an AWS Glue Spark job that fails with the following error:
An error occurred while calling o362.cache. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ...; S3 Extended Request ID: ...; Proxy: null), S3 Extended Request ID: ...
I believe the error is thrown at line where the Spark persist() method is called on a DataFrame. The Glue job is assigned an IAM role that has full S3 access (all locations/operations allowed), yet I'm still getting the S3 exception. I tried setting the "Temporary path" for the Glue job on the AWS Console to a specific S3 bucket with full access, I also tried setting the Spark temporary directory to a specific S3 bucket with full access via:
conf = pyspark.SparkConf()
conf.set('spark.local.dir', 's3://...')
self.sc = SparkContext(conf=conf)
which didn't help. It's very strange that the job is failing even with full S3 access. Not sure what to try next, any help would be really appreciated. Thank you!

Related

IAM Identity of AWS glue job - AccessDenied Exception

I have a glue job that reads data from glue catalog table, and writes it back into s3 in Delta format.
IAM role of the glue job has s3:PutObject, List, Describe and all other permissions needed to interact with s3 (read and write). However, I keep running into the error -
2022-12-14 13:48:09,274 ERROR [Thread-9] output.FileOutputCommitter (FileOutputCommitter.java:setupJob(360)): Mkdirs failed to create glue-d-xxx-data-catalog-t--m-w:///_temporary/0
2022-12-14 13:48:13,875 WARN [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(73)): Lost task 5.0 in stage 1.0 (TID 6) (172.34.113.239 executor 2): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: >
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HG7ST1B44A6G30JC; S3 Extended Request ID: tR1CgoC1RcXZHEEZZ1DuDOvwIAmqC0+flXRd1ccdsY3C8PyjkEpS4wHDaosFoKpRskfH1Del/NA=; Proxy: null)
This error does not appear when I open up s3 bucket access with wildcard(principal:*) in the s3 bucket permissions section. Job fails even if I change the principal section to the same role as Glue jobs are associated with.
Now, my question is - is there a different identify that AWS Glue assumes to run the job. The IAM role associated with the job has all the permissions to interact with s3 but it throws above AccessDenied exception ( failed to create directory). However, job succeeds with wildcard(*) on s3 permissions.
Just to add some more context - this error does not happen when I am using native glue constructs like dynamic frame, spark data frame to read, process and persist data into s3. It only happens with delta format.
Below is the samplec code
src_dyf = glueContext.create_dynamic_frame.from_catalog(database="<db_name>", table_name="<>table_name_glue_catalog")
dset_df = src_dyf.toDF() # dynamic frame to dta frame conversion
# write data frame into s3 prefix in delta format.
glueContext.write_data_frame_from_catalog(
frame=dset_df,
database="xxx_data_catalog",
table_name="<tbale_name>",
additional_options=additional_options #contains key-value pair of s3 path with key, path
)

How to write log and data in Druid Deep Storage in AWS S3

We have a druid cluster setup and now i am trying to write the indexing-logs and data into S3 deep storage.
Following are the details
druid.storage.type=s3
druid.storage.bucket=bucket-name
druid.storage.baseKey=druid/segments
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs
After running ingestion task i am getting below error
*Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HCAFAZBA85QW14Q0; S3 Extended Request ID: 2ICzpVAyFcy/PLrnsUWZBJwEo7dFl/S2lwDTMn+v83uTp71jlEe59Q4/vFhwJU5/WGMYramdSIs=; Proxy: null*)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) ~[aws-java-sdk-core-1.12.37.jar:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) ~[aws-java-sdk-core-1.12.37.jar:?]
I tried to add the IAM role instance to the bucket level and same Role is running EC2 where Druid services are running.
Cam someone please guide what are the steps i am missing here.

I got it done!
I have created a new IAM role and created a policy where i have given permission to S3 bucket and subfolder
NOTE: Permission to S3 bucket is must
Example: If bucket name is "Bucket_1" and subfolder where Deep storage is configured is "deep_storage"
then make sure we should give permisson like:
**"arn:aws:s3:::Bucket_1"
"arn:aws:s3:::Bucket_1/*"**
I was missing with not giving to Bucket level permission and directly trying to give permission to sub folder level.
Also remove or comment out the below parameter from common.runtime.properties file from each servers of your Druid cluster
**druid.s3.accessKey=
druid.s3.secretKey=**
After this config I can see the data is getting successfully to S3 deep storage with IAM role and not with Secret & Access Key.

AWS CloudFormation Getting 403 When Accessing S3

We have a CodePipeline process set up, and all stages work except the CodeDeploy stage.
Our pipeline stage is as follows:
GenerateChangeSet for CloudFormation
ExecuteChangeSet for CloudFormation
Deploy for CodeDeploy
These stages were set up and configured by CodeStar.
Our GenerateChangeSet stage tries to access s3 to get our BuildArtifact, but fails with the following error:
Action execution failed
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 40P7HSHQGWXSRA72; S3 Extended Request ID: I6hiCC7xx+YmnQMLfUnMzZziLDz/5b8uJWzOqWNZwSiVRCS14Q6UyVfss6q80teO5MAGuR9Xft4=; Proxy: null)
This suggests that CloudFormation cannot access s3, but I've checked and rechecked the policy that it uses and it definitely has the correct permissions for accessing s3.
I'm not quite sure why this error is happening, given that the role policy does indeed have access to s3. I even went with the nuclear option of granting this role full control over s3 (with a view to reverting once I solved the issue), but to no avail, the error still occurs.
Has anyone encountered this before? Anyone know why it might be happening?

I discovered the issue. The CloudFormation template file (template.yml and template-configuration.yml) was reading the one from the repo, but that had been removed at some point prior, so I was getting access denied errors from that resource.
I wish the error message was more explicit, it would have saved hours.

S3 Lifecycle error on AWS EMR reading AWS Redshift using PySpark

I am trying to read redshift table into EMR cluster using pyspark. I am currently running my code on shell using pyspark but eventually want to make a script that I can submit using spark-submit. I am using 4 jar files to make pyspark be able to connect and read data from redshift.
I start pyspark using:
pyspark --jars minimal-json-0.9.5.jar,RedshiftJDBC4-no-awssdk-1.2.41.1065.jar,spark-avro_2.11-3.0.0.jar,spark-redshift_2.10-2.0.1.jar
Then I am running below code:
key = "<key>"
secret = "<secret>"
redshift_url = "jdbc:redshift://<cluster>:<port>/<dbname>?user=<username>&password=<password>"
redshift_query = "select * from test"
redshift_temp_s3 = "s3a://{}:{}#<bucket-name>/".format(key, secret)
data = spark.read.format("com.databricks.spark.redshift")
.option("url", redshift_url)
.option("query", redshift_query)
.option("tempdir", redshift_temp_s3)
.option("forward_spark_s3_credentials", "true")
.load()
Error Stacktrace:
WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: FS6MDX8P2MBG5T0G; S3 Extended Request ID: qH1q9y1C2EWIozr3WH2Qt7ujoBCpwLuJW6W77afE2SKrDiLOnKvhGvPC8mSWxDKmR6Dx0AlyoB4=; Proxy: null), S3 Extended Request ID: qH1q9y1C2EWIozr3WH2Qt7ujoBCpwLuJW6W77afE2SKrDiLOnKvhGvPC8mSWxDKmR6Dx0AlyoB4=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1828)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1412)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1374)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
Then it waits a couple of seconds before showing the correct output. I also see the folder created in the s3 bucket. I have not turned on bucket versioning but it does have a lifecycle created. I do not understand why it first shows the error and then the correct output as well.
Any help would be appreciated.

Access Denied error on running athena query from root account

I am getting an access denied error when I try to run athena query from root account. what am I doing wrong?
I have tried to create IAM user roles, but not sure if I am doing right. I just wanted to do a quick test.
Create s3 bucket -> upload csv -> go to athena -> pull data from s3 -> run query
Error that I am getting is:
Your query has the following error(s):
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: BF8CDA860116C79B; S3 Extended Request ID: 825bTOZNiWP1bUJUGV3Bg5NSzy3ywqZdoBtwYItrxQfr8kqDpGP1RBIHR6NFIBySgO/qIKA8/Cw=)
This query ran against the "sampledb" database, unless qualified by the query. Please post the error message on our forum or contact customer support with
Query Id: c08c11e6-e049-46f1-a671-0746da7e7c84.
What am I doing wrong. I just wanted to do a quick test

If executing query from AWS Athena Web console, Ensure you have access to S3 bucket location of table.you can extract location from SHOW CREATE TABLE command.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Spark job failing on DataFrame persist() - amazon-web-services

Related

IAM Identity of AWS glue job - AccessDenied Exception

How to write log and data in Druid Deep Storage in AWS S3

AWS CloudFormation Getting 403 When Accessing S3

S3 Lifecycle error on AWS EMR reading AWS Redshift using PySpark

Access Denied error on running athena query from root account

Categories

Resources