I am learning Spark fundamentals and in order to test my Pyspark application created an EMR instance with Spark, Yarn, Hadoop, Oozie on AWS. I am successfully able to execute a simple pyspark application from the driver node using spark-submit. I have the default /etc/spark/conf/spark-default.conf file created by AWS which is using Yarn Resource Manager. Everything runs fine and I can monitor the Tracking URL as well.
But I am not able to differentiate between whether the spark job is running in 'client' mode or 'cluster' mode. How do I determine that?
Excerpts from /etc/spark/conf/spark-default.conf
spark.master yarn
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address ip-xx-xx-xx-xx.ec2.internal:18080
spark.history.ui.port 18080
spark.shuffle.service.enabled true
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.executor.memory 4743M
spark.executor.cores 2
spark.yarn.executor.memoryOverheadFactor 0.1875
spark.driver.memory 2048M
Excerpts from my pypspark job:
import os.path
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from boto3.session import Session
conf = SparkConf().setAppName('MyFirstPySparkApp')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
....# access S3 bucket
....
....
Is there a deployment mode called 'yarn-client' or is it just 'client' and 'cluster'?
Also, why is "num-executors" not specified in the config file by AWS? Is that something I need to add?
Thanks
It is determined by how you send the option when you submit the job, see the Documentation.
Once you access to the spark history server from the EMR console or by web server, you can find the spark.submit.deployMode option in the Environment tab. In my case, it is client mode.
By default spark application runs in client mode, i.e. driver runs on the node where you're submitting the application from. Details about these deployment configurations can be found here. One easy to verify it would be to kill the running process by pressing ctrl + c on terminal after the job goes to RUNNING state. If it's running on client mode, the app would die. If it's running in cluster mode it would continue to run, because the driver is running in one of the worker nodes in EMR cluster. A sample spark-submit command to run the job in cluster mode would be
spark-submit --master yarn \
--py-files my-dependencies.zip \
--num-executors 10 \
--executor-cores 2 \
--executor-memory 5g \
--name sample-pyspark \
--deploy-mode cluster \
package.pyspark.main
By default number of executors is set to 1. You can check the default values for all spark configs here.
Related
I want to connect to a document db which has TLS enabled .I could do that from a lambda function with the rds-combined-ca-bundle.pem copied with lambda code .I could not do the same with databricks as all the node of cluster should have this file when spark try to connect it always time out.I tried to create the init scripts by following below link
https://learn.microsoft.com/en-us/azure/databricks/kb/python/import-custom-ca-cert
However it does not help either .Let me know if any one has any clue on this kind of use case .
Note:I can connect to TLS disabled document-db from same databricks instance .
If you are experiencing connection time out errors when using an init script to import the rds-combined-ca-bundle.pem file on your Spark cluster, try the following steps:
Make sure that the rds-combined-ca-bundle.pem file is available on the driver node of your Spark cluster. The init script will only be executed on the driver node. You will encounter connection time out errors otherwise.
Use the --conf option when starting the spark-shell or spark-submit command to specify the location of the rds-combined-ca-bundle.pem file on the driver node. To specify the location of the rds-combined-ca-bundle.pem file, run:
spark-shell --conf spark.mongodb.ssl.caFile=path/to/rds-combined-ca-bundle.pem
Check the Spark cluster logs whether the init script is being executed correctly or if its encountering any errors.
I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.
Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py
No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.
I'm setting up a new dataproc server and using initilization-action to run a custom script. The script runs fine on 2 datanodes but not executing on master node.
Tried looking for logs under /var/log/dataprog-initilization-*.log but unable to find the file in the master node.
Has anyone else faced this issue before?
Thanks in advance!!
gcloud command:
gcloud dataproc clusters create test-cluster \
--region=us-central1 --zone=us-central1-a \
--master-machine-type=n1-standard-4 --master-boot-disk-size=200 \
--initialization-actions=gs://dp_init_data/init2.sh --initialization-action-timeout="2m" \
--num-workers=2 --worker-machine-type=n1-standard-8 --worker-boot-disk-size=200
DataNode error log:
2019-07-11 03:29:22,123 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-268987178-10.32.1.248-1562675355441 (Datanode Uuid 71664f82-1d23-4184-b19b-28f86b01a251) service to exp-gcp-kerberos-m.c.exp-cdh-prod.internal/10.32.1.248:8051 Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(10.32.1.60:9866, datanodeUuid=71664f82-1d23-4184-b19b-28f86b01a251, infoPort=0, infoSecurePort=9865, ipcPort=9867, storageInfo=lv=-57;cid=CID-aee57974-1706-4b8c-9654-97da47ad0464;nsid=128710770;c=1562675355441)
According to your DataNode error log, seems you are expecting the init action to be run first on master, then workers. But init actions are run in parallel, you have to add logic to sync between master and workers. I think you can simply add some wait in workers, or if you want something more reliable, write a flag file in GCS when master init is done, check that file in workers.
I have a scenario where I have a setup of AWS EMR with few applications such as Spark, Hadoop, Hive, HCatalog, Zeppelin, Sqoop, etc. And, I have another server which runs only Airflow.
I am working on a requirement, where I want to move MySQL tables ( which is again on a different RDS instance ) to Hive using Sqoop and this trigger has to be submitted by Airflow.
Is it possible to achieve this using SqoopOperator available in Airflow given that Airflow is in remote server? I believe not, then is there any other way to achieve this?
Thanking in advance.
Yes, this is possible. I'll admit the documentation on how to use the operators is lacking but if you understand the concept of hooks and operators in Airflow, you can figure it out by reading the code of the operator you're looking to use. In the case, you'll want to read through the SqoopHook and the SqoopOperator codebase. Most of what I know how to do with Airflow comes from reading the code, while I haven't used this operator, I can try and help you out here best I can.
Let's assume you want to execute this this sqoop command:
sqoop import --connect jdbc:mysql://mysql.example.com/testDb --username root --password hadoop123 --table student
And you have a Sqoop server running on a remote host which you can access with the Scoop client at http://scoop.example.com:12000/sqoop/.
First, you'll need to create the connection in the Airflow Admin UI, call the connection sqoop. For the connection, fill in host as scoop.example.com, schema as sqoop, and port as 12000. If you have a password, you will need to put this into a file on your server and in extras fill out a json string that looks like {'password_file':'/path/to/password.txt'} (see inline code about this password file).
After you set up the connection in the UI, can now create an task using the SqoopOperator in you DAG file. This might look like this:
sqoop_mysql_export = SqoopOperator(conn_id='sqoop',
table='student',
username='root',
password='password',
driver='jdbc:mysql://mysql.example.com/testDb',
cmd_type='import')
You can see the full list of paramaters you might want to pass for imports can be found in the code here.
You can see how the SqoopOperator (and really the SqoopHook which the operator leverages to connect to Sqoop) translates these arguments to command line commands here.
Really this SqoopOperator just works by translating the kwargs you pass into sqoop client CLI commands. If you check out the SqoopHook, you can see how that's done and probably figure out how to make it work for your case. Good luck!
To troubleshoot, I would recommend SSHing into the server you're running Airflow on and confirm you can run the Scoop client from the command line and connect to the remote Scoop server.
Try to add step by use script-runner.jar. here is more.
aws emr create-cluster --name "Test cluster" –-release-label emr-5.16.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m4.large --instance-count 3 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
Then you can do it like this.
SPARK_TEST_STEPS = [
{
'Name': 'demo',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'cn-northwest-1.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
"s3://d.s3.d.com/demo.sh",
]
}
}
]
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=SPARK_TEST_STEPS,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
step_adder.set_downstream(step_checker)
demo.sh like this.
!/bin/bash
sqoop import --connect jdbc:mysql://mysql.example.com/testDb --username root --password hadoop123 --table student
I killed the hiveserver2 process (after finding the PID with ps aux|grep -i hiveserver2) on my EMR cluster with one master and two workers. Before killing hiveserver2 I was able to browse and query Hive on my browser via HUE. I tried restarting with hive --service hiveserver2 but then I can't connect from HUE anymore and it either hangs or says that it can't connect to the <publicDNS>:10000
My use case is that I want to modify the hive configuration of my EMR cluster without shutting down the cluster. Is that possible at all?
initctl list
status hive-server2
sudo restart hive-server2
sudo stop hive-server2
sudo start hive-server2
How do I restart a service in Amazon EMR?
Hive configurations can be added before you launch your cluster, not after you have the cluster ready. You can add them as configuration settings in bootstrap step.
E.G. You can add your configurations in hive-site.xml using following syntax (in java):
Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.vectorized.execution.enabled","true");
hiveProperties.put("hive.vectorized.execution.reduce.enabled","true");
hiveProperties.put("hive.execution.engine","Tez");
hiveProperties.put("hive.auto.convert.join","true");
hiveProperties.put("hive.exec.parallel","true");
Configuration myHiveConfig = new Configuration()
.withClassification("hive-site")
.withProperties(hiveProperties);
List <Application> apps = new ArrayList<Application>();
apps.add(new Application().withName("Hadoop"));
apps.add(new Application().withName("Hive"));
apps.add(new Application().withName("Spark"));
//apps.add(new Application().withName("Pig"));
//apps.add(new Application().withName("Zeppelin-Sandbox"));
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("abc")
.withReleaseLabel(emrVersion) //"emr-4.3.0"
.withServiceRole("EMR_DefaultRole")
.withConfigurations(myHiveConfig)
.withInstances(
new JobFlowInstancesConfig()
.withInstanceCount(numberofInstances)
.withKeepJobFlowAliveWhenNoSteps(true)
.withTerminationProtected(false)
.withMasterInstanceType(mserverType)
.withSlaveInstanceType(sserverType)
)
.withApplications(apps)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withSteps(generalSteps);
More details in link below:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html