Facing access denied on querying struct columns - amazon-web-services

I am able to query my table using Redshift spectrum. However, when I try to access a column, defined as a struct, I am getting the following error:
ERROR: Spectrum Scan Error: S3ServiceException:Access Denied,Status 403,Error AccessDenied
Any idea why is this happening?

I think I'm suffering this pain too. What happens if you go
set json_serialization_enable to true;
And then select the struct field?
And try like
select json_extract_path_text(structfield, 'key') from external_schema.table;
I feel that the S3 access error is bogus and instead there is an issue with the Glue table definition that's not quite right.

Like you mentioned you are using the amazon redshift spectrum.
when you are getting access denies it is probably an issue of permissions, please check your role and its permissions which it attached to the redshift cluster.
you mentioned your column which is defined as struct did you create your external table like this https://stackoverflow.com/a/66705424/13126651
here my external table was a single object with single key entries and value was array of objects.
example
CREATE EXTERNAL TABLE jatinspectrum.extab (
enteries array<struct<title:varchar(4000),link:varchar(4000),author:varchar(4000),published_date:timestamp,category:array<varchar(4000)>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile
LOCATION 's3://xxxxxxxxxxxxxx/xxxxxxxxxxxxxx/xxxxxxxxxxx/';
Refers docs for roles and external table

Related

Does hierarchical partitioning works in AWS Athena/S3?

I am new to AWS and trying to use S3 and Athena for a use case.
I want the data saved as json files in S3 to be queried from Athena. To reduce the data scan i have created directory structure like this
../customerid/date/*.json (format)
../100/2020-04-29/*.json
../100/2020-04-30/*.json
.
.
../101/2020-04-29/*.json
In Athena the table structure has been created according to the data we are expecting and 2 partitions have been created namely customer (customerid) and dt (date).
I want to query all the data for customer '100' and limit my scan to its directory for which i am trying to load the partition as follows
alter table <table_name> add
partition (customer=100) location 's3://<location>/100/’
But I get the following error
FAILED: SemanticException partition spec {customer=100} doesn't contain all (2) partition columns
Clearly its not loading a single partition when multiple partitions have been created
Giving both partitions in alter table
alter table <table_name> add
partition (customer=100, dt=2020-04-22) location 's3://<location>/100/2020-04-22/'
I get this error
missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;
Am i doing something wrong?
Does this even works?
If not is there a way to work with hierarchical partitions?
The issue is with the S3 hierarchial structure that you have given. it should be
../customer=100/dt=2020-04-29/*.json
instead of
../100/2020-04-29/*.json
If you have data in S3 stored in the correct prefix structure as mentioned, then you could add partitions with simple msck repair msck repair table <table_name> command.
Hope this clarifies
I figured the mistake i was making so wanted to share in case anyone finds themselves in the same situation.
For data not partitioned in hive format (Refer this for hive and non-hive format)
Taking the example above, following is the alter command that works
alter table <table_name> add
partition (customer=100, dt=date '2020-04-22') location 's3://<location>/100/2020-04-22/'
Notice the change in the syntax of "dt" partition. As my partition datatype was set to "date" type and not using it while loading the partition was giving the error.
although not giving the data type also works we just need to give single quotes that defaults the partition type to string/varchar
alter table <table_name> add
partition (customer=100, dt='2020-04-22') location 's3://<location>/100/2020-04-22/'
I prefer giving the date data type while adding as that is how i configured my partition.
Hope this helps.

Error trying to access Amazon Redshift external table

I have avro files in S3 which I want to be able to query via Redshift. Have used external tables with success in the past but only in parquet/JSON format so wondering whether I'm missing something with the data being in avro format maybe.
I set up a glue crawler to get hold of the schema of the files and that has worked fine. I can access the data in Athena. I've also set up an external schema in Redshift and can see the new external table exists when I query SVV_EXTERNAL_TABLES. However, when I come to query the new table I get the following error:
[XX000][500310] Amazon Invalid operation: Invalid
DataCatalog response for external table
"spectrum_google_analytics"."man": Cannot deserialize Table. Error:
I don't know why this would work for athena but not spectrum. Hoping you can help. Thanks!
The same issue happened to be as well when I was trying to use aws-cdk for deploying resources. Turns out having no parameters in properties of Glue Table will cause this weird behaviour (https://github.com/aws/aws-cdk/issues/7826), add some property like classification=Parquet/JSON and try again, worked for me.

HIVE_BAD_DATA: Wrong type using parquets on AWS Athena

I've created a Glue Crawler to read files from S3 and create a table for each S3 path. The table health_users were created using a wrong type for a specific column: the column two_factor_auth_enabled were created as int instead of string.
Manually, I went to Glue Catalog and updated the schema of table health_users.
After that, I tried to run the query again on Athena and it still throwing the same error:
Your query has the following error(s):
HIVE_BAD_DATA: Field two_factor_auth_enabled's type BOOLEAN in parquet is incompatible with type int defined in table schema
This query ran against the "test_parquets" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c3a86b98-70a2-4c70-97d8-8bc377c455b8.
I've checked the table structure on Athena and the column two_factor_auth_enabled is a string (the file attached shows table definition):
What's wrong with my solution? How can I fix this error?

Aws Athena - Rename column name

I am trying to change a column name in an AWS Athena table.
From old_name to new_name.
Normal DDL commands does not affect the table (They cannot be executed).
Is It possible to change a column name without deleting and re-creating the table from scratch ?
I was mistaken, Athena uses HIVE DDL syntax so the correct command is :
ALTER TABLE %%table-name%% CHANGE %%old-column-name%% %%new-column-name%%<string>;
I based my answer on a hive related question.
You can find more about supported and unsupported DDLs here

Getting 0 rows while querying external table in redshift

We created the schema as follows:
create external schema spectrum
from data catalog
database 'test'
iam_role 'arn:aws:iam::20XXXXXXXXXXX:role/athenaaccess'
create external database if not exists;
and table as follows:
create external table spectrum.Customer(
Subr_Id integer,
SUB_CURRENTSTATUS varchar(100),
AIN integer,
ACCOUNT_CREATED timestamp,
Subr_Name varchar(100),
LAST_DEACTIVATED timestamp)
partitioned by (LAST_ACTIVATION timestamp)
row format delimited
fields terminated by ','
stored as textfile
location 's3://cequity-redshiftspectrum-test/'
table properties ('numRows'='1000');
the access rights are as follows:
Roles of athenaQuickSight access, Full Athena access, and s3 full access are attached to the redshift cluster
However, when we query as below we are getting 0 records. please help.
select count(*) from spectrum.Customer;
If your query returns zero rows from a partitioned external table, check whether a partition has been added to this external table. Redshift Spectrum only scans files in an Amazon S3 location that has been explicitly added using ALTER TABLE … ADD PARTITION. Query the SVV_EXTERNAL_PARTITIONS view to finding existing partitions. Run ALTER TABLE ADD … PARTITION for each missing partition.
Reference
I had the same issue. Doing the above, resolved my issue.
P.S. Explicit run of ALTER TABLE command to create partition can also be automated.