I created a aws emr cluster with hadoop, spark and zeppelin.
Following the document https://zeppelin.apache.org/docs/0.8.0/interpreter/jdbc.html , which says
Fill Interpreter name field with whatever you want to use as the
alias(e.g. mysql, mysql2, hive, redshift, and etc..). Please note that
this alias will be used as %interpreter_name to call the interpreter
in the paragraph. Then select jdbc as an Interpreter group.
But jdbc option didn't show :
The emr zeppelin version is /usr/lib/zeppelin/zeppelin-web-0.8.0.war I have checked.
What should I do ?
This answer was resolved in the AWS thread.
TLDR; just run the following line and restart Zeppelin:
sudo /usr/lib/zeppelin/bin/install-interpreter.sh -n jdbc
Then Restart Zeppelin with:
sudo stop zeppelin
sudo start zeppelin
Related
I'm trying to automatically run airflow webserver and scheduler in a VM upon boot using startup scripts just followed the documentation here: https://cloud.google.com/compute/docs/instances/startup-scripts/linux . Here is my script:
export AIRFLOW_HOME=/home/name/airflow
cd /home/name/airflow
nohup airflow scheduler >> scheduler.log &
nohup airflow webserver -p 8080 >> webserver.log &
The .log files are created which means the script is been executed but the webserver and the scheduler don't.
Any apparent reason?
I have tried replicating the Airflow webserver Startup script on GCP VM using the document.
Steps followed to run Airflow webserver Startup script on GCP VM :
Create a Service Account. Give minimum access to BigQuery with the role of BigQuery Job User and Dataflow with the role of Dataflow Worker. Click Add Key/Create new key/Done. This will download a JSON file.
Create a Compute Engine instance. Select the Service Account created.
Install Airflow libraries. Create a virtual environment using miniconda.
Init your metadata database and register at least one admin user using command:
airflow db init
airflow users create -r Admin -u username -p mypassword -e example#mail.com -f yourname -l lastname
Whitelist IP for port 8080. Create Firewall Rule and add firewall rule on GCP VM instance. Now go to terminal and start web server using command
airflow webserver -p 8080.
Open another terminal and start the Scheduler.
export AIRFLOW_HOME=/home/acachuan/airflow-medium
cd airflow-medium
conda activate airflow-medium
airflow db init
airflow scheduler
We want our Airflow to start immediately after the Compute Engine starts. So we can create a Cloud Storage bucket and then create a script, upload the file and keep it as a backup.
Now pass a Linux startup script from Cloud Storage to a VM. Refer Passing a startup script that is stored in Cloud Storage to an existing VM. You can also pass a startup script to an existing VM.
Note : PermissionDenied desc = The caller does not have permission means you don’t have sufficient permissions, you need to request access from your project, folder, or organization admin. Depending on the assets you are trying to export. And to access files which are created by root users you need read, write or execute permissions. Refer File permissions.
I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/
I would like to know if GCP's DataProc supports WebHCat. Googling hasn't turned up anything.
So, does GCP DataProc support/provide WebHCat and if so what is the URL endpoint?
Dataproc does not provide WebHCat out of the box, however, its trivial to create an initialization action such as:
#!/bin/bash
apt-get install hive-webhcat-server
WebHCat will be available on port 50111:
http://my-cluster-m:50111/templeton/v1/ddl/database/default/table/my-table
Alternatively, it is possible to setup a JDBC connection to HiveServer2 (available by default):
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC
As of now you can use Dataproc Hive WebHCat component to activate Hive WebHCat during cluster creation:
gcloud dataproc clusters create $CLUSTER_NAME --optional-components=HIVE_WEBHCAT
How can i run periodic job in background on EMR cluster?
I have script.sh with cron job and application.py in s3 and want to run cluster with this command:
aws emr create-cluster
--name "Test cluster"
–-release-label emr-5.12.0
--applications Name=Hive Name=Pig Name=Ganglia Name=Spark
--use-default-roles
--ec2-attributes KeyName=myKey
--instance-type m3.xlarge
--instance-count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,
Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,
Args["s3://mybucket/script-path/script.sh"]
Finally, i want that cron job from script.sh execute application.py
Now i don't understand how to install cron on master node, python file need some libraries, they should be installed to.
You need to SSH into the master node, and then perform the crontab setup from there, not on your local machine:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
Connect to the Master Node Using SSH
Secure Shell (SSH) is a network protocol you can use to create a
secure connection to a remote computer. After you make a connection,
the terminal on your local computer behaves as if it is running on the
remote computer. Commands you issue locally run on the remote
computer, and the command output from the remote computer appears in
your terminal window.
When you use SSH with AWS, you are connecting to an EC2 instance,
which is a virtual server running in the cloud. When working with
Amazon EMR, the most common use of SSH is to connect to the EC2
instance that is acting as the master node of the cluster.
Using SSH to connect to the master node gives you the ability to
monitor and interact with the cluster. You can issue Linux commands on
the master node, run applications such as Hive and Pig interactively,
browse directories, read log files, and so on. You can also create a
tunnel in your SSH connection to view the web interfaces hosted on the
master node. For more information, see View Web Interfaces Hosted on
Amazon EMR Clusters.
By default crontab is installed in linux system, you don't need to install manually.
To add spark job scheduling in cron, please follow below steps
Login into master node (SSH into master).
run command
crontab -e
Add the below line in crontab and save it ( :w)
*/15 0 * * * /script-path/script.sh
Now cron will schedule the jobs for every 15 minutes.
Please refer this link to know about cron.
Hope this helps.
Thanks
Ravi
I want to run queries by passing them as string to some supported command of AWS through its CLI.
I can see that the commands specific to AWS Redshift mentioned doesnt have anything which says that it can execute commands remotely
Link : https://docs.aws.amazon.com/cli/latest/reference/redshift/index.html
Need help on this.
You need to use psql. There is no API interface to redshift.
Redshift is based loosely on postgresql however so you can connect to the cluster using the psql command line tool.
https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html
You can use Redshift Data API to execute queries on Redshift using AWS CLI.
aws redshift-data execute-statement
--region us-west-2
--secret arn:aws:secretsmanager:us-west-2:123456789012:secret:myuser-secret-hKgPWn
--cluster-identifier mycluster-test
--sql "select * from stl_query limit 1"
--database dev
https://aws.amazon.com/about-aws/whats-new/2020/09/announcing-data-api-for-amazon-redshift/
https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
https://docs.aws.amazon.com/cli/latest/reference/redshift-data/index.html#cli-aws-redshift-data