Setting up AWS Glue to crawl Qubole - amazon-web-services

Currently I work with Qubole to access Hive data. I've added metadata from several databases, and want to add all the Hive metadata to AWS Glue. Is this possible? Any help is appreciated.

Yes you can add all the metadata from Qubole to AWS Glue catalog, but Glue crawler doesn't support Qubole as source.
You can use AWS Glue Sync Agent to synchronize metadata changes from Hive metastore in Qubole to AWS Glue Data Catalog.Refer to this which talks about more on how to achieve.

Related

Connect Redshift Spectrum/ AWS EMR with Hudi directly or via AWS Glue Data Catalog

I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.
Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.
The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.
Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?
Given that with Spark on EMR you need a catalog, Hive metastore if you will, then using the AWS Glue Catalog is an option.
If you elect to use Glue as metastore then use that as the source for all data. Unless errors are evident in which case use the Hudi api for Spark.

Can we access AWS Glue Tables using jdbc?

I need to access some tables which are there in AWS Glue which i am using as a metastore. I wanted to know if Glue provides any jdbc endpoint to connect to it just like HIVE does.
I understand that it is possible to read data into AWS glue from other databases like MYSQL, Oracle etc using JDBC but my requirement is opposite and i have to read from AWS glue using JDBC. Please help if it is possible as I could not find a reference for this.
For accessing the data from glue catalog, follow these steps:
Run the crawler and update the table in glue catalog.
To access these tables using JDBC or ODBC endpoint, you need athena.
Download the driver from this link.
Read the docs for creating the url according to your region here
Also go through this documentation for additional properties
Hope it helps

AWS Athena catalog other than Glue

It's known that AWS Athena comes integrated with AWS Glue for data catalog. Is there any way to configure Athena to use a different catalog e.g. to point to a different Hive metastore (e.g. on EC2 instances) managed by user?
Athena integration with a Hive Metastore is a new feature, now available in preview mode. You can find the details of how to use this feature in the documentation.

Using AWSGlue as Hive Metastore where data is in S3

apologies if this has been answered elsewhere (but I don't think it has). I'm trying to use AWSGlue as an external metastore for Hive via an EMR cluster.
I have some data stored as text files on S3, and via the AWSGlue web console I created a table definition over those text files.
I also started up an EMR cluster following directions here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
When I ssh into my EMR cluster and try to access Hive, I was expecting to find that the table I created in AWSGlue would exist when I ran a "show tables" command, but instead I get the following error message when starting the interactive Hive shell:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazona ws.services.glue.model.AccessDeniedException: Please migrate your Catalog to enable access to this database (Service: AWSGlue; Statu s Code: 400; Error Code: AccessDeniedException; Request ID: e6b2a87b-fe5a-11e8-8ed4-5d1e42734679))
It seems like there's some permission error involved here. I'm using EMR_EC2_DefaultRole for my EC2 Instance Profile, so I didn't think this would happen.
Am I missing something obvious?
Thanks for any help you can provide!
Kindly attach AWS GLUE and S3 full access to you current IAM Role. that should do it.
In order to solve this issue, you will have to migrate your existing Athena catalog to Glue Data Catalog as explained here
To confirm your Athena catalog has been migrated, execute the following commands using the AWS cli:
aws glue get-catalog-import-status --catalog-id <aws-account-id> --region <region>
I have faced exactly same issue recently and able to get over it by upgrading Athena to Glue Catalog.
I am also not using Athena or Redshift Spectrum to query the table but on Glue and Athena consoles there was a message saying:
To use the AWS Glue Data Catalog as a common metadata repository for Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR, you need to upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum.
Soon after upgrading I was able to query the table through hive and spark shells from EMR cluster.
This might be because your default and other databases are already created via Athena before and should be upgraded to Glue to be used from EMR as default is the default database that your hive will use after going to hive in EMR. You have to use DBName.tableName or explicitly switch to the DB that consists of your table to run queries.

AWS Glue Data Catalog, temporary tables and Apache Spark createOrReplaceTempView

According to AWS Glue Data Catalog documentation https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Temporary tables are not supported.
It is not clear to me or under Temporary tables I can also consider the Temporary views that can be created in Apache Spark via DataFrame.createOrReplaceTempView method?
So, in other words - I can't use DataFrame.createOrReplaceTempView method with AWS Glue and AWS Glue Data Catalog, am I right? I can only operate with permanent tables/view with AWS Glue and AWS Glue Data Catalog right now and must use AWS EMR cluster for full-featured Apache spark functionality?
You can use DataFrame.createOrReplaceTempView() in AWS Glue. You have to convert dynamicframe to dataframe using toDF().
But these views will remain in scope of your current glue job instance and won't be accessible from other glue jobs or other instances of same job or athena