Does GCP dataproc include webhcat? - google-cloud-platform

I would like to know if GCP's DataProc supports WebHCat. Googling hasn't turned up anything.
So, does GCP DataProc support/provide WebHCat and if so what is the URL endpoint?

Dataproc does not provide WebHCat out of the box, however, its trivial to create an initialization action such as:
#!/bin/bash
apt-get install hive-webhcat-server
WebHCat will be available on port 50111:
http://my-cluster-m:50111/templeton/v1/ddl/database/default/table/my-table
Alternatively, it is possible to setup a JDBC connection to HiveServer2 (available by default):
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC

As of now you can use Dataproc Hive WebHCat component to activate Hive WebHCat during cluster creation:
gcloud dataproc clusters create $CLUSTER_NAME --optional-components=HIVE_WEBHCAT

Related

Can you access the Airflow CLI within Google Cloud Composer?

I'm aware that many of the common Airflow management commands are made available through the gcloud CLI. However, I'm troubleshooting some DAG scheduling and would like to use the schedule and next_execution commands directly on the cluster.
Is there an easy way to do this?
It's possible to access the full Airflow CLI by using kubectl exec to SSH into Composer pods. To do so, obtain the name of the GKE cluster associated with your environment, and get cluster credentials for it:
gcloud container clusters get-credentials $CLUSTER_NAME --zone=$ZONE
Then, use kubectl to check for the Composer namespace, and then find a pod and SSH to it:
kubectl get namespaces | grep composer
kubectl get pods --namespace=$NAMESPACE | grep airflow
kubectl exec -it --namespace=$NAMESPACE $POD_NAME -- bash
From within a pod, you can use airflow with any command supported by that version of Airflow. However, it should also be noted that this also provides full access to commands that can make your environment permanently unusable (such as resetdb), so they should be used with care.

Generating a kubeconfig for access to an Amazon EKS cluster

Given a scenario where I have two Kubernetes clusters, one hosted on AWS EKS and the other on another cloud provider, I would like to manage the EKS cluster from the other cloud provider. What's the easiest way to authenticate such that I can do this?
Would it be reasonable to generate a kubeconfig, where I embed the result from aws get-token (or something like that) to the cluster on the other cloud provider? Or are these tokens not persistent?
Any help or guidance would be appreciated!
I believe the most correct is the way described in Create a kubeconfig for Amazon EKS
yes, you create kubeconfig with aws eks get-token and later add newly created config to KUBECONFIG environment variable , eg
export KUBECONFIG=$KUBECONFIG:~/.kube/config-aws
or you can add it to .bash_profile for your convenience
echo 'export KUBECONFIG=$KUBECONFIG:~/.kube/config-aws' >> ~/.bash_profile
For detailed steps please refer to provided url.
I had this use case where I needed to work with multi-cloud providers.
So I created kubech to deal with that situation and manage multiple clusters simultaneously.
Assuming that you have a linux platform on the second cloud provider, you can use the following command for generating kube config file:
aws eks update-kubeconfig --region <region-code> --name <cluster-name>
You can change the file using --kubeconfig flag.
Ref: https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html

Import manually created K8s cluster into KOps

It's been sometime I've visited all the web pages carrying word "KOps import" but did not find a way to import my manually created K8s cluster. Manually created cluster means "Deployed Infra on AWS using Terraform and Kubernetes using Terraform's provisioner script as Shell script". Now as I see managing the environment manually is a pain, I look forward to move it under KOps. For that I have done the following so far:
Installed aws cli, kubectl and kops in my local machine.
Created KOps user with policies AmazonEC2FullAccess,
AmazonRoute53FullAccess, AmazonS3FullAccess, IAMFullAccess,
AmazonVPCFullAccess and generated access and secret keys.
Configured credentials using aws configure.
Created S3 bucket to store state.
Set env variables like Region and Cluster name.
Finally, ran kops import command as below:
kops import cluster --region ${REGION} --name ${OLD_NAME}
But encountered below error:
Cluster.kops "jjm-prod-use1-kubernetes" not found
Verbosed:
$ kops import cluster --region ${REGION} --name ${OLD_NAME} -v 10
I0131 16:32:12.059651 25683 factory.go:68] state store s3://kops-state-store-jjm
I0131 16:32:13.133145 25683 s3context.go:194] found bucket in region "us-east-1"
I0131 16:32:13.133174 25683 s3fs.go:220] Reading file "s3://kops-state-store-jjm/jjm-prod-use1-kubernetes/config"
Which made me serious about posting this question. Is there any possible way where a K8s cluster created except using kubeup.sh can be brought under the control of KOps ? Please advise.
Note: There's no way I can re-create (destroy and create) the clusters as they are running in production.
EDIT: I know this can be achieved only the cluster was setup using kubeup.sh. But is there any other way ?
That is only possible with cluster bootstrapped via kube-up.sh script as officialy announced in Kops documentation pages. Actually, kube-up.sh has been excluded from the list of supported Kubernetes installation tools for AWS. Although, cluster composed by kube-up.sh provides a lot of customization settings which are specifically applicable to AWS, the initial script uses environmental variables to define these settings. Therefore, I assume that it's quite hard to achieve in your case.

aws emr zeppelin doesn't have jdbc interpreter

I created a aws emr cluster with hadoop, spark and zeppelin.
Following the document https://zeppelin.apache.org/docs/0.8.0/interpreter/jdbc.html , which says
Fill Interpreter name field with whatever you want to use as the
alias(e.g. mysql, mysql2, hive, redshift, and etc..). Please note that
this alias will be used as %interpreter_name to call the interpreter
in the paragraph. Then select jdbc as an Interpreter group.
But jdbc option didn't show :
The emr zeppelin version is /usr/lib/zeppelin/zeppelin-web-0.8.0.war I have checked.
What should I do ?
This answer was resolved in the AWS thread.
TLDR; just run the following line and restart Zeppelin:
sudo /usr/lib/zeppelin/bin/install-interpreter.sh -n jdbc
Then Restart Zeppelin with:
sudo stop zeppelin
sudo start zeppelin

spark-submit from outside AWS EMR cluster

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/