Spark Redshift: error while reading redshift tables using spark - amazon-web-services

I am getting below error while reading data from redshift table using spark.
Below is the code:
Dataset<Row> dfread = sql.read()
.format("com.databricks.spark.redshift")
.option("url", url)
//.option("query","select * from TESTSPARK")
.option("dbtable", "TESTSPARK")
.option("forward_spark_s3_credentials", true)
.option("tempdir","s3n://test/Redshift/temp/")
.option("sse", true)
.option("region", "us-east-1")
.load();
error:
Exception in thread "main" java.sql.SQLException: [Amazon](500310) Invalid operation: Unable to upload manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid=,CanRetry 1
Details:
error: Unable to upload manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 6FC2B3FD56DA0EAC,ExtRid I,CanRetry 1
code: 9012
context: s3://jd-us01-cis-machine-telematics-devl-data-
processed/Redshift/temp/f06bc4b2-494d-49b0-a100-2246818e22cf/manifest
query: 44179
Can any one please help?

You're getting a permission error from S3 when Redshift tries to access the files you're telling it to load.
Have you configured the access keys for S3 access before calling the load()?
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "ASDFGHJKLQWERTYUIOP")
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "QaZWSxEDC/rfgyuTGBYHY&UKEFGBTHNMYJ")
You should be able to check which access key id was used from the Redshift side by querying the stl_query table.

From the error "S3ServiceException:Access Denied"
It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps
Add a bucket policy to that bucket that allows the Redshift Account
access Create an IAM role in the Redshift Account that redshift can
assume Grant permissions to access the S3 Bucket to the newly created role
Associate the role with the Redshift cluster
Run COPY statements

Related

AWS Glue Job - From S3 bucket to Redshift throws No Such Bucket

I'm trying to run a glue job from a data catalog that I created previously to Redshift. And It's throwing this error:
An error occurred while calling o151.pyWriteDynamicFrame. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket;
Notes:
I have PowerUser access Role, so I have permission
The bucket exists
I have a connection between glue and Redshift
It's in the same region

Amazon Athena error opening Hive split s3 path and Access Denied

I'm query data from glue catalog. For some of table I can see the data and some of table getting below error:
Error opening Hive split s3://test/sample/run-1-part-r-03 (offset=0, length=1156) using org.apache.hadoop.mapred.TextInputFormat: Permission denied on S3 path: s3://test/sample/run-1-part-r-03
I have give full access to Athena.
Amazon Athena adopts the permissions from the user when accessing Amazon S3.
If the user can access the objects in Amazon S3, then they can access them via Amazon Athena.
Does the user who ran the command have access to those objects?

How to use requester pays bucket with AWS Redshift Spectrum?

I'm trying to query data in the requester-pays enabled S3 bucket but I got the following 403 error. The Redshift IAM user has permission for the bucket. How can I read the data using Redshift Spectrum - give requester-pays parameter?
ERROR:
[XX000][500310] [Amazon](500310) Invalid operation: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid...

Presto cannot access AWS S3 data

I'm trying to connect my local Presto to AWS Glue for metadata and S3 for data. I'm able to connect to Glue and do show tables; and desc <table>;. However, it's giving me this error when I do select * from <table>;
Query <query id> failed: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: <id>; S3 Extended Request ID: <id>)
My hive.properties file looks like this
connector.name=hive-hadoop2
hive.metastore=glue
hive.metastore.glue.region=<region>
hive.s3.use-instance-credentials=false
hive.s3.aws-access-key=<access key>
hive.s3.aws-secret-key=<secret key>
The error says the credentials are not recognized as valid. Since you can connect to Glue, it seems your environment or ~/.aws has some valid credentials. You should be able to utilize those credentials for S3 access as well.
For this, make sure you are using Presto 332 or better and remove hive.s3.use-instance-credentials, hive.s3.aws-access-key, hive.s3.aws-secret-key from your settings.

AWS Glue cannot create database from crawler: permission denied

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM role, and run it, and I get the following error:
Database does not exist or principal is not authorized to create tables. (Database name: zzz-db, Table name: avroavro_all) (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 78fc18e4-c383-11e9-a86f-736a16f57a42). For more information, see Setting up IAM Permissions in the Developer Guide (http://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html).
I tried to create this table in a new blank database (as opposed to an existing one with tables), I tried prefixing the names, I tried sourcing different schemas, and I tried using an existing role with Admin access. I though the latter would work, but I keep getting the same error, and have no idea why.
To be explicit, the service role I created has several policies I assume a premissive enough to create tables:
The logs are vanilla:

19:52:52
[10cb3191-9785-49dc-8935-fb02dcbd69a3] BENCHMARK : Running Start Crawl for Crawler avro
19:53:22
[10cb3191-9785-49dc-8935-fb02dcbd69a3] BENCHMARK : Classification complete, writing results to database zzz-db
19:53:22
[10cb3191-9785-49dc-8935-fb02dcbd69a3] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
19:53:34
[10cb3191-9785-49dc-8935-fb02dcbd69a3] ERROR : Insufficient Lake Formation permission(s) on s3://zzz-data/avro-all/ (Database name: zzz-db, Table name: avroavro_all) (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 31481e7e-c384-11e9-a6e1-e78dc8223fae). For more information, see Setting up IAM Permissions in the Developer Guide (http://docs.aws.amazon.com/glu
19:54:44
[10cb3191-9785-49dc-8935-fb02dcbd69a3] BENCHMARK : Crawler has finished running and is in state READY
I had the same problem when I setup and ran a new AWS crawler after enabling Lake Formation (in the same AWS account). I've been running Glue crawler for a long time and was stumped when I saw this new error.
After some trial and error, I found that the root cause of the problem is when you enable Lake Formation, it adds an additional layer of permission on new Glue database(s) that are created via Glue Crawler and to any resource (Glue catalog, S3, etc) that you add it to the Lake Formation service.
To fix this problem, you have to grant the Crawler's IAM role, a proper set of Lake Formation permissions (CRUD) for the database.
You can manage these permissions in AWS Lake Formation console (UI) under the Permissions > Data permissions section or via awscli lake formation commands.
I solved this problem by adding a grant in AWS Lake Formations -> Permissions -> Data locations. (Do not forget to add a forward slash (/) behind the bucket name)
I had to add the custom role I created for Glue to the "Data lake Administrators" grantees:
(Note: just saying this solves the crawler's denied access. There may be something with lesser privileges to do...)
Make sure you gave the necessary permissions to your crawler's IAM role in this path:
Lake Formation -> Permissions -> Data lake permissions
(Grant related Glue Database permissions to your crawler's IAM role)