I need to access some tables which are there in AWS Glue which i am using as a metastore. I wanted to know if Glue provides any jdbc endpoint to connect to it just like HIVE does.
I understand that it is possible to read data into AWS glue from other databases like MYSQL, Oracle etc using JDBC but my requirement is opposite and i have to read from AWS glue using JDBC. Please help if it is possible as I could not find a reference for this.
For accessing the data from glue catalog, follow these steps:
Run the crawler and update the table in glue catalog.
To access these tables using JDBC or ODBC endpoint, you need athena.
Download the driver from this link.
Read the docs for creating the url according to your region here
Also go through this documentation for additional properties
Hope it helps
Related
I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.
Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.
The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.
Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?
Given that with Spark on EMR you need a catalog, Hive metastore if you will, then using the AWS Glue Catalog is an option.
If you elect to use Glue as metastore then use that as the source for all data. Unless errors are evident in which case use the Hudi api for Spark.
It's known that AWS Athena comes integrated with AWS Glue for data catalog. Is there any way to configure Athena to use a different catalog e.g. to point to a different Hive metastore (e.g. on EC2 instances) managed by user?
Athena integration with a Hive Metastore is a new feature, now available in preview mode. You can find the details of how to use this feature in the documentation.
Can we execute sql query inside DMS task so that it just fetches the required data and not the whole db.
If its not possible then which aws service is used to fetch query based data from on-prem data source to aws S3.
You can use filters and/or exclude fields: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.html
Contact me if you have problems.
For alternate solution to DMS, you can use AWS Glue with data retrieved using PYSPARK dataframe from on prem DB to either s3 and AWS RDS. This works very well. The only down side is the cost.
This solution supports both table and SQL as input for data extraction
I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.
I need to integrate AWS Athena service with an exists Hive Metastore (not AWS Glue).
Can you please let me know how can I connect Athena to Hive Metastore.
Athena works only with its own metastore or the related AWS Glue metastore. It will not work with an external metastore.
However, you can set up multiple tables or databases on the same underlying S3 storage. So if you wrote data to S3 using an external metastore, you could query those files with Athena, after setting up an appropriate database and table definition in Athena's metastore.
Amazon Athena just released a new feature (in preview now) that allows you to connect Athena to your Apache Hive Metastore. You can see the announcement here. Detailed steps to add the Hive Metastore connector are available in the Athena documentation.
Another way is to export the hive metadata to a file using the command
command="hive -f "+schema+"_tables.hql -S >> "+schema+".output"
where schema= to be exported and import the table definitions to Athena using groovy in python.
Instructions to set up groovy can be found in the link
https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-athena-importing-hive-metastores