This is the error we get in Athena: HIVE_UNKNOWN_ERROR: Error creating an instance of com.facebook.presto.hive.lakeformation.CachingLakeFormationCredentialsProvider
The bucket is registered with Lake Formation
Role used for querying Athena has been given full access in Lake Formation to the database and all the tables in the database
Role has been given access to the underlying s3 bucket in the Data Locations section of Lake Formation.
Contacted AWS support. Turns out the problem was that I had "-" and "." in my Athena database name. According to Athena documentation:
"The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character." (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-names)
For some reason this was not a problem when we were working outside the Lake Formation, but as soon as we registered the S3 location in LF, it started failing. I have confirmed that removing those characters from the database name solves the problem.
Make sure you included the slash (/) behind the bucket name
Related
I've recently been looking into the Apache Iceberg table format to reduce Athena query times on a Glue table with a large number of partitions, the additional features would be a bonus (transactions, row-level updates/deletes, time-travel queries etc). I've successfully built the tables and confirmed that they address the issue at-hand but I'd now like to be able to share the table with another AWS account, we've done this previously using Lake Formation cross-account grants and also the method described here but both approaches raise errors in the alternate account when trying to query the shared table. I've also tried using a bucket policy and registering a duplicate Glue table in the other account which doesn't throw an error but no rows are found when querying.
Is this currently possible to do? I'm aware that I could achieve this by providing role access into the account with the iceberg table but this complicates interaction with the table from other services in the alternate account. Any ideas appreciated.
Edit: When querying the lake formation table I see 'Generic internal error - access denied', it's documented that Iceberg tables don't work with Lake Formation so this is expected. When querying the table shared via cross account data catalog I see 'HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info' when running a SELECT query and 'FAILED: SemanticException Unable to fetch table XXXXXXXXX. Unable to get table: java.lang.NullPointerException' when running SHOW CREATE TABLE or DESCRIBE. I can successfully run SHOW TBLPROPERTIES.
As of now Apache Iceberg Lake Formation integration is not supported:
Lake Formation – Integration with AWS Lake Formation is not supported.
https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html
The bounty expires tomorrow. Answers to this question are eligible for a +50 reputation bounty.
Faris is looking for an answer from a reputable source.
I tried to import data from an external Amazon S3 bucket (the Dynamic Yield Daily Activity Stream, as it happens) into BigQuery by using the Data Transfer tab.
I created a new data set in my project and created an empty table with no schema (since the s3 data is Parquet, so am I right that I don't need to add a schema to the table?).
I then made a new data transfer with the S3 bucket credentials, selecting my new data set and table as the destination. I have tried multiple times but I get the same error, "Failed to obtain the location of the source S3 bucket. Additional details: Access Denied"
However, when checking with the owner of the bucket they have confirmed 100% that I do have the correct access, and on their end they have successfully pulled data from the bucket. I have been able to pull data from the bucket using Cloudberry Explorer myself too, with the same credentials.
So what have I done wrong? Is it because I didn't define the table schema? Or something else? Maybe the data set location is wrong? What else could be the problem?
Thanks
According to the BigQuery Documentation for Amazon S3 transfers, you do need the schema definition for the table.
Best of luck!
This is definitely an access issue. It can stem from two places though:
Your Access Key ID and Secret are incorrect
Your S3 URI is incorrect
What does your S3 URI look like? Sometimes access is given to an individual "folder" or object rather than a whole bucket.
I get the exact same error when accessing an incorrect S3 bucket with a valid ID and Key.
And to confirm the table needs to be created ahead of time:
And finally, since you're using parquet it does work with an empty schemaless table:
*I used a star notation to grab the file: s3://mys3bucket/*
I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.
How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/, s3://staging_data, s3://curated_data) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/..., s3://bucket_name/staging/). I understand AWS S3 is nothing more than containers.
Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?
Is there any best/recommended approach given that we can read and write to S3 in multiple ways?
I looked at this as well.
S3 performance Best Pratices
There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:
By default you may have only 100 buckets in an account - it could be increased although
You may issue 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix (directory) in a single bucket (although the number of prefixes is not limited)
You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.
I've checked AWS FAQ, and other resources however cannot find an answer to it. I can contact AWS for technical support however I do not have permission.
I've checked S3 that stores query results from Athena however it does not seem to have query results from queries using Athena via QuickSight.
Is there somewhere else Athena via QuickSight stores there query results?
thanks!
Athena always stores query results on S3. QuickSight probably just uses a different bucket. There should be queries from QuickSight in the query history (possibly in a work group that is not the primary), if you look at the query execution of one of these you should be able to figure out where the output is stored (e.g. aws athena get-query-execution --region AWS_REGION --query-execution-id ID and look for OutputLocation in the result).
This answer comes a bit late but it says in the Quicksight documentation:
"The Athena workgroup must have an associated S3 output location."
So it seems Quicksight stores query results of a particular data source always in the bucket which is configured as output location for the Athena workgroup which is associated to this data source.
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name