I've recently been looking into the Apache Iceberg table format to reduce Athena query times on a Glue table with a large number of partitions, the additional features would be a bonus (transactions, row-level updates/deletes, time-travel queries etc). I've successfully built the tables and confirmed that they address the issue at-hand but I'd now like to be able to share the table with another AWS account, we've done this previously using Lake Formation cross-account grants and also the method described here but both approaches raise errors in the alternate account when trying to query the shared table. I've also tried using a bucket policy and registering a duplicate Glue table in the other account which doesn't throw an error but no rows are found when querying.
Is this currently possible to do? I'm aware that I could achieve this by providing role access into the account with the iceberg table but this complicates interaction with the table from other services in the alternate account. Any ideas appreciated.
Edit: When querying the lake formation table I see 'Generic internal error - access denied', it's documented that Iceberg tables don't work with Lake Formation so this is expected. When querying the table shared via cross account data catalog I see 'HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info' when running a SELECT query and 'FAILED: SemanticException Unable to fetch table XXXXXXXXX. Unable to get table: java.lang.NullPointerException' when running SHOW CREATE TABLE or DESCRIBE. I can successfully run SHOW TBLPROPERTIES.
As of now Apache Iceberg Lake Formation integration is not supported:
Lake Formation – Integration with AWS Lake Formation is not supported.
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html
Related
I created a sort of data warehouse so you can make SQL queries in documentDB. I did so using Athena and a documentDBConnector.
(https://docs.aws.amazon.com/athena/latest/ug/athena-prebuilt-data-connectors-docdb.html)
However, I’d also like to set very deep and specific permission for a user who makes these queries in Athena and be able to specify table / column level permissions and I’m trying to see if I can do so using Lake Formation provided that I’m also using that documentDBConnector.
It doesn't seem like I can specify that docDB connector as a data source in lake formation.
If this is not possible, does anyone know of any other ideas that would let me specify more detailed permissions for making Athena Queries?
I am trying to build AWS QuickSight reports using AWS Athena that builds the specific views for said reports. however, I seem to only be able to select a single table in creating the Glue job despite being able to select all tables i need for the crawler of the entire DB from Dynamo.
What is the simplest route to get a complete extract of all tables that is queryable in Athena.
I dont want to connect the reports direct to dynamoDB as it s a production database and want to create some separation to avoid any performance degradation by a poor query etc.
I would like to create via Terraform an Athena database including tables and views. I have already searched a lot and found some posts, e.g. here: Create AWS Athena view programmatically
I know that I can use Terraform provisioners to execute AWS CLI commands to create these resources, for example like this: AWS Athena Create table view with SQL
But I don't want to do that. I want to create everything (as far as possible) with Terraform so that I don't have to worry about lifecycle etc.
As far as I understand, an Athena database can be a Glue database, depending on the source you choose. If I choose the AWSDataCatalog (Glue) as data source in Athena, it should not matter if I create an Athena database or a Glue database with Terraform, correct?
In Glue I can also create tables, but no views. Do the Glue tables automatically correspond to Athena tables? How can I create Athena views? I would like to create everything with SQL DDL, just like you can do it in the AWS Web Console. How does this work via Terraform? If this functionality is not available, what is the best way to go? I am grateful for every tip and help!
Athena uses the Glue Data Catalog to store metadata about databases, tables, and views. All Athena tables are Glue tables. However, not all Glue tables work with Athena – you can create tables in Glue that won't be visible in Athena, and you can create tables that will be visible but won't work (for example cause runtime errors when you query them).
Athena uses Glue Data Catalog for views, but the format is very specific to Athena, unlike regular tables which can be made interoperable with for example Spark.
In an answer to the question you link to I explain in detail the anatomy of an Athena view. I have created views with CloudFormation with that information so it can be done with Terraform too. Unless you write code you will have to jump through all the hoops and repeat most of the information as Presto metadata, unfortunately.
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.