I've crawled a couple of XML files on S3 using AWS Glue, using a simple XML classifier:
However, when I try running any query on that data using AWS Athena, I get the following error (note that it's the simplest possible query I'm doing here):
HIVE_UNKNOWN_ERROR: Unable to create input format
Note that Athena can see my tables and it can see the columns, it just can't query them:
I noticed that there is someone with the same problem on the AWS Discussion forums: Athena XML Query Give HIVE Unknown Error but it got no love from anyone.
I know there is a similar question here about this error but the query in question targeted an RDS database, unlike an S3 bucket like I have here.
Has anyone got a solution for this?
Sadly at this time 12/2018 Athena cannot query XML input which is hard to understand when you may hear that Athena along with AWS Glue can query xml.
What output you are seeing from the AWS crawler is correct though, just not what you think its doing! For example after your crawler has run and you see the tables, but cannot execute any Athena queries. Go into your AWS Glue Catalog and at the right click tables, click your table, edit properties it will look something like this:
Notice how input format is null? If you have any other tables you can look at their properties or refer back to the input formatters documentation for Athena. This is the error you recieve.
Solutions:
convert your data to text/json/avro/other supported formats prior to upload
create a AWS glue job which converts a source to target from xml to target supported Athena format(compressed hopefully with ORC/Parquet)
Related
I have been working with AWS Athena for a while and need to do create a backup and version control of the views. I'm trying to build an automation for the backup to run daily and get all the views.
I tried to find a way to copy all the views created in Athena using boto3, but I couldn't find a way to do that. With Dbeaver I can see and export the views SQL script but from what I've seen only one at a time which not serve the goal.
I'm open for any way.
I try to find answer to my question in boto3 documentation and Dbeaver documentation. read thread on stack over flow and some google search did not took me so far.
Views and Tables are stored in the AWS Glue Data Catalog.
You can Query the AWS Glue Data Catalog - Amazon Athena to obtain information about tables, partitions, columns, etc.
However, if you want to obtain the DDL that was used to create the views, you will probably need to use SHOW CREATE TABLE [db_name.]table_name:
Analyzes an existing table named table_name to generate the query that created it.
Have you tried using get_query_results in boto3? get_query_results
I am trying using the AWS JavaScript Node.JS SDK to make a query using AWS Athena and store the results in a table in AWS Glue with Parquet format (not just a CSV file)
If I am using the conosle, it is pretty simple with a CTAS query :
CREATE TABLE tablename
WITH (
external_location = 's3://bucket/tablename/',
FORMAT = 'parquet')
AS
SELECT *
FROM source
But with AWS Athena JavaScript SDK I am only able to set an output file destination using the Workgoup or Output parameters and make a basic select query, the results would output to a CSV file and would not be indexed properly in AWS Glue so it breaks a bigger process it is part of, if I try to call that query using the JavaScript SDK I get :
Table properties [FORMAT] are not supported.
I would be able to call that DDL statement using the Java SDK JDBC driver connection option.
Is anyone familiar with a solution or workaround with the Javascript SDK for Node.JS?
There is no difference between running the SQL you posted in the Athena web console, AWS SDK for JavaScript, AWS SDK for Java, or the JDBC driver, none of these will process the SQL, so if the SQL works in one of these it will work in all of them. It's only the Athena service that reads the SQL.
Check your SQL and make sure you really use the same in your code as you have tried in the web console. If they are indeed the same, the error is somewhere else in your code, so post that too.
Update the problem is the upper case FORMAT. If you paste the code you posted into the Athena web console, it bugs out and doesn't run the query, but if you run it with the CLI or an SDK you get the error you posted. You did not run the same SQL in the console as in the SDK, if you had you would have gotten the same error in both.
Use lower case format and it will work.
This is definitely a bug in Athena, these properties should not be case sensitive.
I have avro files in S3 which I want to be able to query via Redshift. Have used external tables with success in the past but only in parquet/JSON format so wondering whether I'm missing something with the data being in avro format maybe.
I set up a glue crawler to get hold of the schema of the files and that has worked fine. I can access the data in Athena. I've also set up an external schema in Redshift and can see the new external table exists when I query SVV_EXTERNAL_TABLES. However, when I come to query the new table I get the following error:
[XX000][500310] Amazon Invalid operation: Invalid
DataCatalog response for external table
"spectrum_google_analytics"."man": Cannot deserialize Table. Error:
I don't know why this would work for athena but not spectrum. Hoping you can help. Thanks!
The same issue happened to be as well when I was trying to use aws-cdk for deploying resources. Turns out having no parameters in properties of Glue Table will cause this weird behaviour (https://github.com/aws/aws-cdk/issues/7826), add some property like classification=Parquet/JSON and try again, worked for me.
I have a .sql file filled with Athena queries.
Is there a way I can tell Athena to run the sql queries saved in s3://my-bucket/path/to/queries.sql?
In MySQL can do something like this (based in SO answer), but curious if possible in Athena
mysql> source \home\user\Desktop\test.sql;
Is there a way I can tell Athena to run the sql queries saved in s3://my-bucket/path/to/queries.sql?
I think there is no direct way to tell Athena to run query stored in S3.
In MySQL can do something like this (based in SO answer), but curious if possible in Athena.
If you want to do it at all, then yes, you should be able to run the query using AWS CLI.
Your steps should be look like this.
Get the query from S3 using CLI and store in temp variable
Pass the query stored in a temp variable to Athena Query CLI
Hope this will help.
Currently using information_schema.tables to list all tables in my catalog.
What I am missing, is a column to tell me which S3 path each table (external) is pointing to.
Looked in all the information_schema tables, but cannot see this info.
The only place I've seen this via 'sql' is with the 'SHOW CREATE TABLE' command, which doesn't give the result in a proper recordset.
Failing that ... is there another way to keep tabs on all of your tables and their sources ?
Many Thanks.
So as above, could find no way of doing this from the database.
Actual solution below for interest (& in case anyone finds a better way)
From CLI:
Call AWS glue get-tables & output json to file
Sync file to S3
ETL job to convert multi-line json into single-line json and place in new bucket
Crawl new bucket
Now query/unnest in Athena
'convoluted' is a word that comes to mind !
At least it gets there data I need where I need it
Again, if anyone finds an easier way.... ?