I would like to use a few Aurora(MySQL) tables as source when creating external tables on AWS Athena. Because, those tables are mutable and gets updated often. I see Hive and Presto support this by using org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler. Is there an equivalent of any such feature in AWS Athena?
Amazon Athena is dedicated to running interactive ad hoc SQL queries against data on Amazon S3, the mentioned feature isn't supported yet.
https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
You have two solutions in this situation:
1) You can use Glue or EMR service.
https://docs.aws.amazon.com/glue/latest/dg/console-connections.html
2) You can export the data from Aurora to S3 and then you can use Athena.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.SaveIntoS3.html
Related
I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.
Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.
The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.
Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?
Given that with Spark on EMR you need a catalog, Hive metastore if you will, then using the AWS Glue Catalog is an option.
If you elect to use Glue as metastore then use that as the source for all data. Unless errors are evident in which case use the Hudi api for Spark.
Is it possible to query things in an RDS database using Athena? Or do I somehow have to get my data out of RDS and copy it into an s3 bucket so that Athena can query it from there? If that is the case how can I know the tables that are in my RDS? Is there a way to explore all the schemas of a database with Glue?
A feature was created exactly for this reason last year, Federated Queries.
By using this you can query across a large number of data sources other than just across S3.
If you're using either MySQL or Postgres in RDS then you can make use of the JDBC connector, with additional instructions here.
It's known that AWS Athena comes integrated with AWS Glue for data catalog. Is there any way to configure Athena to use a different catalog e.g. to point to a different Hive metastore (e.g. on EC2 instances) managed by user?
Athena integration with a Hive Metastore is a new feature, now available in preview mode. You can find the details of how to use this feature in the documentation.
I am new in AWS. I want to use AWS glue for ETL process.
Could we use AWS glue for analyzing the RDS database and store the analyzed data into rds mysql table using ETL job
Thanks
Yes, its possible. We have used S3 to store our raw data, from where we read the data in AWS Glue, and perform UPSERTs to RDS Aurora as part of our ETL process. You can either use AWS Glue trigger or a Lambda S3 event triggers for calling the glue job.
We have used pymysql / mysql.connector in AWS Glue since we have to do UPSERTs. Bulk load data directly from S3 is also supported for RDS Mysql (Aurora). Let me know if you need help with code sample
I need to integrate AWS Athena service with an exists Hive Metastore (not AWS Glue).
Can you please let me know how can I connect Athena to Hive Metastore.
Athena works only with its own metastore or the related AWS Glue metastore. It will not work with an external metastore.
However, you can set up multiple tables or databases on the same underlying S3 storage. So if you wrote data to S3 using an external metastore, you could query those files with Athena, after setting up an appropriate database and table definition in Athena's metastore.
Amazon Athena just released a new feature (in preview now) that allows you to connect Athena to your Apache Hive Metastore. You can see the announcement here. Detailed steps to add the Hive Metastore connector are available in the Athena documentation.
Another way is to export the hive metadata to a file using the command
command="hive -f "+schema+"_tables.hql -S >> "+schema+".output"
where schema= to be exported and import the table definitions to Athena using groovy in python.
Instructions to set up groovy can be found in the link
https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-athena-importing-hive-metastores