Bulk hive table creation in Google Dataproc - google-cloud-platform

I am very new to Google Cloud Platform, and I am doing a POC for moving a hive application (tables and jobs) to Google Dataproc. The data has already been moved to Google cloud Storage.
Is there an inbuilt way to create all the tables from hive in dataproc in bulk, instead of creating one by one using the hive prompt?

Dataproc support Hive job type, so you can use the gcloud command:
gcloud dataproc jobs submit hive --cluster=CLUSTER \
-e 'create table t1 (id int, name string); create table t2 ...;'
or
gcloud dataproc jobs submit hive --cluster=CLUSTER -f create_tables.hql
You can also SSH into the master node, then use beeline to execute the script:
beeline -u jdbc:hive2://localhost:10000 -f create_tables.hql

Related

GCP CLI Command to remove Data Catalog Tag

what is the GCP CLI Command to remove (detach) a Data Catalog Tag from a BigQuery Dataset, and also CLI Command to Update Tag.
I am able to do it manually how to do it using Cloud Shell CLI gcloud command?
You can use gcloud data-catalog tags update and gcloud data-catalog tags delete commands.
The tricky part here is obtaining values for --entry-group and --entry parameters - BigQuery entries are automatically ingested to Data Catalog, and have automatically assigned entry group id and entry id. To get these values use gcloud data-catalog entries lookup command.

Getting DataProc output in GCP Logging

I have a DataProc job that outputs some logs during the execution. I can see those logs in the Job output.
My cluster is created according to the documentation with such parameters:
dataproc:jobs.file-backed-output.enable=true
dataproc:dataproc.logging.stackdriver.enable=true
dataproc:dataproc.logging.stackdriver.job.driver.enable=true
dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true
I can see all system logs in Logging, but not the output from my job. The maximum I found is the URL to the rolling output file (even not a concrete file).
Is there any chance I can forward job output to Logging?
As per documentation cluster can be created with spark:spark.submit.deployMode=cluster so the output will be logged into yarn user logs group. But whenever I do that my job is failing with:
21/03/15 16:20:16 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
I was able to create a cluster and submit jobs as follows.
I went to StackDriver and refreshed my page.
After refreshing, I could see Cloud Dataproc Job logging filter.
Also I noticed that both the jobs I ran, the job output was logged as 'Any Log Level'. Not sure if you are using any log level filtering.
Are you able to see Cloud Dataproc Job in logging filter after passing dataproc:dataproc.logging.stackdriver.job.driver.enable=true and refreshing the page?
Are you using one of the supported image versions?
Repro steps:
Cluster Creation:
gcloud dataproc clusters create log-exp --region=us-central1 \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true'
Job Submisson: PySpark
gcloud dataproc jobs submit pyspark \
gs://dataproc-examples/pyspark/hello-world/hello-world.py \
--cluster=log-exp \
--region=us-central1
Job Submission: Spark
gcloud dataproc jobs submit spark \
--cluster=log-exp \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
-- 100
Cloud Dataproc Job filter
Loggin level

Is there a gcloud command to export a spanner database?

I would like to automate exports of our Spanner database to Google Cloud Storage. Is this possible using the gcloud SDK? I could not find a command for this.
Is there any other recommended way to back up Spanner databases?
The export and Import pipelines are Dataflow templates that can be started using the Gcloud command.
See the third paragraph in:
https://cloud.google.com/spanner/docs/export
And how to run the template in:
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloud_spanner_to_gcs_avro
(Select the Gcloud tab in the executing the template section).
Yes, it is possible to do this using gcloud, but it is not a direct Cloud Spanner command. The detailed documentation is here.
Essentially you use gcloud to run a Cloud Dataflow job to export or backup your data to GCS using a command like the following:
gcloud dataflow jobs run [JOB_NAME] \
--gcs-location='gs://dataflow-templates/latest/Cloud_Spanner_to_GCS_Avro' \
--region=[DATAFLOW_REGION] \
--parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],outputDir=[YOUR_GCS_DIRECTORY]

Does GCP dataproc include webhcat?

I would like to know if GCP's DataProc supports WebHCat. Googling hasn't turned up anything.
So, does GCP DataProc support/provide WebHCat and if so what is the URL endpoint?
Dataproc does not provide WebHCat out of the box, however, its trivial to create an initialization action such as:
#!/bin/bash
apt-get install hive-webhcat-server
WebHCat will be available on port 50111:
http://my-cluster-m:50111/templeton/v1/ddl/database/default/table/my-table
Alternatively, it is possible to setup a JDBC connection to HiveServer2 (available by default):
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC
As of now you can use Dataproc Hive WebHCat component to activate Hive WebHCat during cluster creation:
gcloud dataproc clusters create $CLUSTER_NAME --optional-components=HIVE_WEBHCAT

How to run Queries on Redshift Database using AWS CLI

I want to run queries by passing them as string to some supported command of AWS through its CLI.
I can see that the commands specific to AWS Redshift mentioned doesnt have anything which says that it can execute commands remotely
Link : https://docs.aws.amazon.com/cli/latest/reference/redshift/index.html
Need help on this.
You need to use psql. There is no API interface to redshift.
Redshift is based loosely on postgresql however so you can connect to the cluster using the psql command line tool.
https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html
You can use Redshift Data API to execute queries on Redshift using AWS CLI.
aws redshift-data execute-statement
--region us-west-2
--secret arn:aws:secretsmanager:us-west-2:123456789012:secret:myuser-secret-hKgPWn
--cluster-identifier mycluster-test
--sql "select * from stl_query limit 1"
--database dev
https://aws.amazon.com/about-aws/whats-new/2020/09/announcing-data-api-for-amazon-redshift/
https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
https://docs.aws.amazon.com/cli/latest/reference/redshift-data/index.html#cli-aws-redshift-data