How to connect AWS Athena to an exists Hive Metastore - amazon-web-services

I need to integrate AWS Athena service with an exists Hive Metastore (not AWS Glue).
Can you please let me know how can I connect Athena to Hive Metastore.

Athena works only with its own metastore or the related AWS Glue metastore. It will not work with an external metastore.
However, you can set up multiple tables or databases on the same underlying S3 storage. So if you wrote data to S3 using an external metastore, you could query those files with Athena, after setting up an appropriate database and table definition in Athena's metastore.

Amazon Athena just released a new feature (in preview now) that allows you to connect Athena to your Apache Hive Metastore. You can see the announcement here. Detailed steps to add the Hive Metastore connector are available in the Athena documentation.

Another way is to export the hive metadata to a file using the command
command="hive -f "+schema+"_tables.hql -S >> "+schema+".output"
where schema= to be exported and import the table definitions to Athena using groovy in python.
Instructions to set up groovy can be found in the link
https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-athena-importing-hive-metastores

Related

Can we access AWS Glue Tables using jdbc?

I need to access some tables which are there in AWS Glue which i am using as a metastore. I wanted to know if Glue provides any jdbc endpoint to connect to it just like HIVE does.
I understand that it is possible to read data into AWS glue from other databases like MYSQL, Oracle etc using JDBC but my requirement is opposite and i have to read from AWS glue using JDBC. Please help if it is possible as I could not find a reference for this.
For accessing the data from glue catalog, follow these steps:
Run the crawler and update the table in glue catalog.
To access these tables using JDBC or ODBC endpoint, you need athena.
Download the driver from this link.
Read the docs for creating the url according to your region here
Also go through this documentation for additional properties
Hope it helps

AWS Athena catalog other than Glue

It's known that AWS Athena comes integrated with AWS Glue for data catalog. Is there any way to configure Athena to use a different catalog e.g. to point to a different Hive metastore (e.g. on EC2 instances) managed by user?
Athena integration with a Hive Metastore is a new feature, now available in preview mode. You can find the details of how to use this feature in the documentation.

Using AWSGlue as Hive Metastore where data is in S3

apologies if this has been answered elsewhere (but I don't think it has). I'm trying to use AWSGlue as an external metastore for Hive via an EMR cluster.
I have some data stored as text files on S3, and via the AWSGlue web console I created a table definition over those text files.
I also started up an EMR cluster following directions here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
When I ssh into my EMR cluster and try to access Hive, I was expecting to find that the table I created in AWSGlue would exist when I ran a "show tables" command, but instead I get the following error message when starting the interactive Hive shell:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazona ws.services.glue.model.AccessDeniedException: Please migrate your Catalog to enable access to this database (Service: AWSGlue; Statu s Code: 400; Error Code: AccessDeniedException; Request ID: e6b2a87b-fe5a-11e8-8ed4-5d1e42734679))
It seems like there's some permission error involved here. I'm using EMR_EC2_DefaultRole for my EC2 Instance Profile, so I didn't think this would happen.
Am I missing something obvious?
Thanks for any help you can provide!
Kindly attach AWS GLUE and S3 full access to you current IAM Role. that should do it.
In order to solve this issue, you will have to migrate your existing Athena catalog to Glue Data Catalog as explained here
To confirm your Athena catalog has been migrated, execute the following commands using the AWS cli:
aws glue get-catalog-import-status --catalog-id <aws-account-id> --region <region>
I have faced exactly same issue recently and able to get over it by upgrading Athena to Glue Catalog.
I am also not using Athena or Redshift Spectrum to query the table but on Glue and Athena consoles there was a message saying:
To use the AWS Glue Data Catalog as a common metadata repository for Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR, you need to upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum.
Soon after upgrading I was able to query the table through hive and spark shells from EMR cluster.
This might be because your default and other databases are already created via Athena before and should be upgraded to Glue to be used from EMR as default is the default database that your hive will use after going to hive in EMR. You have to use DBName.tableName or explicitly switch to the DB that consists of your table to run queries.

Using MySQL Aurora as source in Athena

I would like to use a few Aurora(MySQL) tables as source when creating external tables on AWS Athena. Because, those tables are mutable and gets updated often. I see Hive and Presto support this by using org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler. Is there an equivalent of any such feature in AWS Athena?
Amazon Athena is dedicated to running interactive ad hoc SQL queries against data on Amazon S3, the mentioned feature isn't supported yet.
https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
You have two solutions in this situation:
1) You can use Glue or EMR service.
https://docs.aws.amazon.com/glue/latest/dg/console-connections.html
2) You can export the data from Aurora to S3 and then you can use Athena.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.SaveIntoS3.html

How to import hive tables from Hadoop datalake to AWS RDS?

I need suggestions on importing data from Hadoop datalake (Kerberos authenticated) to AWS. All the tables in the Hive table should land in s3 and then needs to be loaded to AWS RDS.
I have considered the following options:
1) AWS Glue ?
2) Spark connecting to hive metastore ?
3) Connecting to impala from AWS ?
There are around 50 tables to be imported. How can i maintain the schema ? IS it better to import the data and then create a seperate schema in RDS ?
Personally, I would dump a list of all tables needing moved.
From that, run SHOW CREATE TABLE over them all, and save the queries.
Run distcp, or however else you want to move the data into S3 / EBS
Edit each create table command to specify a Hive table location that's in the cloud data store. I believe you'll need to make all these as external tables since you cannot put data directly into the Hive warehouse directory and have the metastore immediately know of it.
Run all the commands on the AWS Hive connection.
I have coworkers who have used CircusTrain
Impala and Spark are meant for processing. You're going to need to deal mostly with the Hive metastore here.