Deploy Hive tables to Dataproc cluster using Google Deployment Manager - google-cloud-platform

I would like to migrate my HIVE tables to an existing Dataproc cluster. Is there any way to deploy the tables using Google Deployment Manager. I have checked the list of supported resource types in GDM; could not locate a hive resource type. However, there is an instance of dataproc.v1.cluster available and I was able to successfully deploy a dataproc cluster. Now, is there any way to deploy my HIVE ddls within the cluster?

For customizing your cluster we recommend initialization actions.
If your hive tables are all setup as EXTERNAL with data in GCS or BigQuery you could benefit from setting up a metastore on Cloud SQL

Related

Spark on Kubernetes (non EMR) with AWS Glue Data Catalog

I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks.
We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog.
I want to use the Glue Data Catalog as the Hive metastore for these Spark jobs.
I see that it's possible to do when Spark is run in EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
but is it possible from Spark running on EKS?
I have seen this code released by aws:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
but I can't understand if patching of the Hive jar is necessary for what I'm trying to do.
Also I need the hive-site.xml file for connecting Spark to the metastore, how can I get this file from Glue Data Catalog?
I found a solution for that.
I created a new spark image with this instructions: https://github.com/viaduct-ai/docker-spark-k8s-aws
and finally at my job yaml file, I added some configurations
sparkConf:
...
spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"

How can we interact with Dataproc Metastore to fetch list of databases and tables?

I am using Dataproc metastore as a Metastore service with GCP. How can I interact with it to fetch list of databases and tables from it? Is it possible to do this without running dataproc cluster ?
Edit - I have to fetch the metadata without running Dataproc cluster.
Since I am using Dataproc Metastore service to store metadata, I need to fetch metadata directly from it.
The Dataproc Metastore API is used to manage the Dataproc Metastore service instance (get/create/update etc). As mentioned in one of the comments, you can use the thrift URI (you will find the URI under the configuration tab of the metastore service if you are using the console).
Once you have a thrift client that connects to the thrift URI, you can fetch databases or tables. Although you can use the thrift API to create databases and tables as well, the typical use case is to configure a big data processing engine/framework like spark or hive to use the metastore and not directly interact with the metastore.

Can GCP Dataproc sqoop data (or run other jobs on) from local DB?

Can GCP Dataproc sqoop import data from local DB to put into GCP Storage (without GCP VPC)?
We have a remote Oracle DB connected to our local network via VPN tunnel that we use a Hadoop cluster to extract data out of each day via Apache Sqoop. Would like to replace this process with GCP Dataproc cluster to run the sqoop jobs and GCP Storage.
Found this article that appears to be doing something similar Moving Data with Apache Sqoop in Google Cloud Dataproc, but it assumes that users have GCP VPC (which I did not intend on purchasing).
So my question is:
Without this VPC connection, would the cloud dataproc cluster know how to get the data from the DB on our local network using the job submission API?
How would this work if so (perhaps I am do not understand enough about how Hadoop jobs work / get data)?
Some other way if not?
Without using VPC/VPN you will not be able to grant Dataproc access to your local DB.
Instead of using VPC, you can use VPN if it meets your needs better: https://cloud.google.com/vpn/docs/
Only other option that you have is to open up your local DB to Internet so Dataproc will be able to access it without VPC/VPN, but this is inherently insecure.
Installing the GCS connector on-prem might work in this case. It will not require VPC/VPN.

AWS EMR Presto not finding correct Hive schemas using AWS Glue

So I am having an issue with being able to execute Presto queries via AWS EMR.
I have launched an EMR running hive/presto and using AWS Glue as the metastore.
When I SSH into the master node and run hive I can run "show schemas;" and it shows me the 3 different databases that we have on AWS Glue.
If I then enter the Presto CLI and run "show schemas on hive" I only see two "default" and "information_schema"
For the life of me I cannot figure out why presto is not able to see the same Hive schemas.
It is a basic default cluster launch on EMR using default settings mainly.
Can someone point me in the direction of what I should be looking for? I have checked the hive.properties file and that looks good, I am just at a loss as to why presto is not able to see the same info as hive.
I do have the following configuration set
[{"classification":"hive-site", "properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}, "configurations":[]}]
AWS docs http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html make it seem that this should be plug and play but I am obviously missing something
Starting from Amazon EMR release version 5.10.0 you can. Simply, set the hive.metastore.glue.datacatalog.enabled property to true, as follows:
[
{
"Classification": "presto-connector-hive",
"Properties": {
"hive.metastore.glue.datacatalog.enabled": "true"
}
}
]
Optionally, you can manually set
hive.metastore.glue.datacatalog.enabled=true in the
/etc/presto/conf/catalog/hive.properties file on the master node. If
you use this method, make sure that
hive.table-statistics-enabled=false in the properties file is set
because the Data Catalog does not support Hive table and partition
statistics. If you change the value on a long-running cluster to
switch metastores, you must restart the Presto server on the master
node (sudo restart presto-server).
Sources:
AWS Docs
Looks like this has been solved in emr-5.10. You want to add the following config:
{"Classification":"presto-connector-hive","Properties":{"hive.metastore.glue.datacatalog.enabled": "true"}}
Source: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html
The recent 0.198 release of Presto now supports AWS Glue as a metadata source.
Add support for using AWS Glue as the metastore. Enable it by setting
the hive.metastore config property to glue.
https://prestodb.io/docs/current/release/release-0.198.html

Setting hive properties in Amazon EMR?

I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR