I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.
Related
I created one table in glue database using crawler job. Table created successfully.
However, when I am trying to access that table in athena query editor its giving me below error when i am try to select the data from table:
Query:
select * from DB1.data_tbl;
Output:
Hive File Not Found: Partition location does not exist
I haven't found the partition location define.
Please assist.
Athena, by default, can read only data in S3. It will not read your postgresql databases. To connect to anything other than S3, you have to setup and use Amazon Athena Federated Query.
Alternatively, setup a Glue Job to copy all data from your Postegresql into S3, and then use Athena to query the data from S3.
I have setup audit logs storage from Redshift in S3. Now, I am planning to have external tables setup on these audit logs. On trying to use AWS Glue crawler for reading those files, I get tons of tables. There is one table for each file. I was assuming that there will be two tables overall(as we log two of the activities). If someone has any success in reading Amazon Redshift audit logs using external tables, I would like to have your inputs.
Thanks
Why is the AWS Glue crawler creating multiple tables from my source data, and how can I prevent that from happening? - https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-multiple-tables/
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
I have a table in glue catalog which is created by glue crawler after parsing json files in s3. Now when I am querying this table using Athena, I am getting below error. Few things about this situation -
JSON files are in S3
Glue crawler created tables in glue catalog using json serder
table contains nested datatypes like array and struct
I am getting same error while querying other regular fields (excluding nested ones)
I am able to query same glue catalog table using Hive in EMR. Tried with and without nested datatypes and it works fine.
Amazon Athena experienced a transient error while executing this
query. Waiting a couple of minutes and retrying the query may solve
the problem. If you continue to see the issue, please contact customer
support for further assistance. We apologize for the inconvenience.
You will not be charged for this query.
I need suggestions on importing data from Hadoop datalake (Kerberos authenticated) to AWS. All the tables in the Hive table should land in s3 and then needs to be loaded to AWS RDS.
I have considered the following options:
1) AWS Glue ?
2) Spark connecting to hive metastore ?
3) Connecting to impala from AWS ?
There are around 50 tables to be imported. How can i maintain the schema ? IS it better to import the data and then create a seperate schema in RDS ?
Personally, I would dump a list of all tables needing moved.
From that, run SHOW CREATE TABLE over them all, and save the queries.
Run distcp, or however else you want to move the data into S3 / EBS
Edit each create table command to specify a Hive table location that's in the cloud data store. I believe you'll need to make all these as external tables since you cannot put data directly into the Hive warehouse directory and have the metastore immediately know of it.
Run all the commands on the AWS Hive connection.
I have coworkers who have used CircusTrain
Impala and Spark are meant for processing. You're going to need to deal mostly with the Hive metastore here.