I have avro files in S3 which I want to be able to query via Redshift. Have used external tables with success in the past but only in parquet/JSON format so wondering whether I'm missing something with the data being in avro format maybe.
I set up a glue crawler to get hold of the schema of the files and that has worked fine. I can access the data in Athena. I've also set up an external schema in Redshift and can see the new external table exists when I query SVV_EXTERNAL_TABLES. However, when I come to query the new table I get the following error:
[XX000][500310] Amazon Invalid operation: Invalid
DataCatalog response for external table
"spectrum_google_analytics"."man": Cannot deserialize Table. Error:
I don't know why this would work for athena but not spectrum. Hoping you can help. Thanks!
The same issue happened to be as well when I was trying to use aws-cdk for deploying resources. Turns out having no parameters in properties of Glue Table will cause this weird behaviour (https://github.com/aws/aws-cdk/issues/7826), add some property like classification=Parquet/JSON and try again, worked for me.
Related
I am able to query my table using Redshift spectrum. However, when I try to access a column, defined as a struct, I am getting the following error:
ERROR: Spectrum Scan Error: S3ServiceException:Access Denied,Status 403,Error AccessDenied
Any idea why is this happening?
I think I'm suffering this pain too. What happens if you go
set json_serialization_enable to true;
And then select the struct field?
And try like
select json_extract_path_text(structfield, 'key') from external_schema.table;
I feel that the S3 access error is bogus and instead there is an issue with the Glue table definition that's not quite right.
Like you mentioned you are using the amazon redshift spectrum.
when you are getting access denies it is probably an issue of permissions, please check your role and its permissions which it attached to the redshift cluster.
you mentioned your column which is defined as struct did you create your external table like this https://stackoverflow.com/a/66705424/13126651
here my external table was a single object with single key entries and value was array of objects.
example
CREATE EXTERNAL TABLE jatinspectrum.extab (
enteries array<struct<title:varchar(4000),link:varchar(4000),author:varchar(4000),published_date:timestamp,category:array<varchar(4000)>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile
LOCATION 's3://xxxxxxxxxxxxxx/xxxxxxxxxxxxxx/xxxxxxxxxxx/';
Refers docs for roles and external table
In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.
Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_table")
.getDynamicFrame()
Does above line retrieve data from actual source via metadata tables?
I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.
When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.
Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:
[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.
Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.
Yes you need to use getCatalogSource if you want to load tables from Glue catalog.
Does Catalog look into Crawler and refer to actual source and load data?
Check out the diagram in this [link][2] . It will give you an idea about the flow.
What if crawler deleted before I run getCatalogSource, then will I can able to load data in this case?
Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.
What if my Source has lots of million of records? then will this load all records or how in this case?
It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue.
[1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html
[2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
[3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main
[4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets
I've created a Glue Crawler to read files from S3 and create a table for each S3 path. The table health_users were created using a wrong type for a specific column: the column two_factor_auth_enabled were created as int instead of string.
Manually, I went to Glue Catalog and updated the schema of table health_users.
After that, I tried to run the query again on Athena and it still throwing the same error:
Your query has the following error(s):
HIVE_BAD_DATA: Field two_factor_auth_enabled's type BOOLEAN in parquet is incompatible with type int defined in table schema
This query ran against the "test_parquets" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c3a86b98-70a2-4c70-97d8-8bc377c455b8.
I've checked the table structure on Athena and the column two_factor_auth_enabled is a string (the file attached shows table definition):
What's wrong with my solution? How can I fix this error?
I've crawled a couple of XML files on S3 using AWS Glue, using a simple XML classifier:
However, when I try running any query on that data using AWS Athena, I get the following error (note that it's the simplest possible query I'm doing here):
HIVE_UNKNOWN_ERROR: Unable to create input format
Note that Athena can see my tables and it can see the columns, it just can't query them:
I noticed that there is someone with the same problem on the AWS Discussion forums: Athena XML Query Give HIVE Unknown Error but it got no love from anyone.
I know there is a similar question here about this error but the query in question targeted an RDS database, unlike an S3 bucket like I have here.
Has anyone got a solution for this?
Sadly at this time 12/2018 Athena cannot query XML input which is hard to understand when you may hear that Athena along with AWS Glue can query xml.
What output you are seeing from the AWS crawler is correct though, just not what you think its doing! For example after your crawler has run and you see the tables, but cannot execute any Athena queries. Go into your AWS Glue Catalog and at the right click tables, click your table, edit properties it will look something like this:
Notice how input format is null? If you have any other tables you can look at their properties or refer back to the input formatters documentation for Athena. This is the error you recieve.
Solutions:
convert your data to text/json/avro/other supported formats prior to upload
create a AWS glue job which converts a source to target from xml to target supported Athena format(compressed hopefully with ORC/Parquet)
I have read several other posts about this and in particular this question with an answer by greg about how to do it in Hive. I would like to know how to account for DynamoDB tables with variable amounts of columns though?
That is, the original DynamoDB table has rows that were added dynamically with different columns. I have tried to view the exportDynamoDBToS3 script that Amazon uses in their DataPipeLine service but it has code like the following which does not seem to map the columns:
-- Map DynamoDB Table
CREATE EXTERNAL TABLE dynamodb_table (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "MyTable");
(As an aside, I have also tried using the Datapipe system but found it rather frustrating as I could not figure out from the documentation how to perform simple tasks like run a shell script without everything failing.)
It turns out that the Hive script that I posted in the original question works just fine but only if you are using the correct version of Hive. It seems that even with the install-hive command set to install the latest version, the version used is actually dependent on the AMI Version.
After doing a fair bit of searching I managed to find the following in Amazon's docs (emphasis mine):
Create a Hive table that references data stored in Amazon DynamoDB. This is similar to
the preceding example, except that you are not specifying a column mapping. The table
must have exactly one column of type map. If you then create an EXTERNAL
table in Amazon S3 you can call the INSERT OVERWRITE command to write the data from
Amazon DynamoDB to Amazon S3. You can use this to create an archive of your Amazon
DynamoDB data in Amazon S3. Because there is no column mapping, you cannot query tables
that are exported this way. Exporting data without specifying a column mapping is
available in Hive 0.8.1.5 or later, which is supported on Amazon EMR AMI 2.2.3 and later.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html