Zeppelin pyspark interpreter not able to submit application in YARN

Zeppelin pyspark interpreter not able to submit application in YARN - amazon-web-services

Environment : AWS EMR emr-5.11.1 , Zeppelin 0.7.3 , Spark 2.2.1
Problem : Zeppelin pyspark interpreter is not submitting jobs as applications in YARN
As per this , i have done following changes , with no effect
set SPARK_HOME
added spark.executer.memory=5g , spark.cores.max ,
master=yarn-client , spark.home in pyspark interpreter tab in zeppelin
added spark.dynamicAllocation.enabled = true in yarn-site.xml
Restarted interpreter and zeppelin process
Please Help

Solution 1
I have the same question, please upgrade to 0.8.0, the newest version solve that question.
Solution 2
edit $ZEPPELIN_HOME/conf/zeppelin-env.sh, add export SPARK_SUBMIT_OPTIONS="--num-executors 10 --driver-memory 8g --executor-memory 10g --executor-cores 4 ".
if you don't have zeppelin-env.sh, please copy and rename zeppelin-env.sh.template to zeppelin-env.sh.
Solution 3
edit $SPARK_CONF_DIR/spark-defaults.conf and modify what you want to add.
After that, restart your server.

Related

Is there any way to can install airflow on kubernetes on premise

I am new to airflow and need assistance on how to install airflow on k8s .
Needs are:
1 . How to Build docker image of airflow only for webserver and scheduler
2 . How to Build separate docker image for MySQL
3 . How to create airflow.cfg with kunernetes executor?
4 . Any sample would be appreciated.

If you mean Apache Airflow - you can just use existed helm chart, that described here https://airflow.apache.org/docs/apache-airflow/stable/kubernetes.html
Unfortunately from your description it is not clear what kind of problem you want to solve.

What is the correct way of installing a JDBC driver on EMR for Sqoop to use?

I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.
Here are the preliminary steps that I performed in order to run sqoop on EMR
Download the JDBC Driver
Move the JDBC driver to the /usr/lib/sqoop/lib directory
I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:
wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/
When I try to run these commands from an EMR bootstrap script however I get the error:
usr/lib/sqoop/lib/ No such file or directory
After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here
So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.
Here are some solutions which work but they feel like work-arounds
Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)
What is the correct way of installing this JDBC driver on EMR?

The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.
You can also use the jar command-runner.jar and the arguments to be
bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.

I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.

The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

AWS EMR script-runner access error

I'm running emr-5.12.0, with Amazon 2.8.3, Hive 2.3.2, Hue 4.1.0, Livy 0.4.0, Spark 2.2.1 and Zeppelin 0.7.3 on 1 m4.large as my master node and 1 m4.large as core node.
I am trying to execute a bootstrap action that configures some parts of the cluster. One of these includes the line:
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/alternatives/zeppelin-conf/interpreter.json
It makes sure that the Zeppelin uses python3.4 instead of python2.7. It works fine if I execute this in the terminal after SSH'ing to the master node, but it fails when I submit it as a Custom JAR step on the AWS Web interface. I get the following error:
ed: can't read /etc/alternatives/zeppelin-conf/interpreter.json
: No such file or directory
Command exiting with ret '2'
The same thing happens if I use
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/zeppelin-conf/interpreter.json
Obviously I could just change it from the Zeppelin UI, but I would like to include it in the bootstrap action.
Thanks!

It turns out that a bootstrap action submitted throug the AWS EMR web interface is submitted as a regular EMR step, so it's only run on the master node. This can be seen if you click the 'AWS CLI export' in the cluster web interface. The intended bootstrap action is listed as a regular step.
Using the command line to launch a cluster with a bootstrap action bypasses this problem, so I've just used that.
Edit: Looking back at the web interface, it's pretty clear that I was adding regular steps instead of bootstrap actions. My bad!

How to run pyspark on EC2 with IPython starting from the spark-ec2 launch process?

Three steps and I have a spark context in my IPython notebook:
1.) Launch spark on EC2 using the these instructions.
2.) Install anaconda and py4j on every node (set PATH accordingly).
3.) Login to master, cd to the spark folder, then run:
MASTER=spark://<public DNS>:7077 PYSPARK_PYTHON=~/anaconda2/bin/python PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="*"' ./bin/pyspark
This process makes the IPython notebook available on < master public DNS >:8888, which is great, but ... I am currently using a csshx style solution to accomplish step 2.
Question:
How can I set install requirements (on AWS or elsewhere) so that the spark-ec2 script spins up machines with the desired setup?
If that's not possible or simply clunky, what would you suggest? (command line only solutions are preferred)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js