I am using spark-submit and tried to do this in the jar file with .setExecutorEnv("spark.ui.port", "4050") on the spark context, but it still tried to hit 4040. I then tried to put a --conf spark.ui.port=4050 after spark-submit and before --class CLASSNAME, but that didn't work either, this time saying "Error: Unrecognized option '--conf'". How do I get around this? The actual error I'm running into is that there is an active spark server that others are using that is preventing this spark-submit from starting the jetty server. It's then not hitting up other ports, so I'm trying to force it to do that.
--conf spark.ui.port=4050 is a Spark 1.1 feature. You can set it in your codes, such as:
val conf = new SparkConf().setAppName(s"SimpleApp").set("spark.ui.port", "4050")
val sc = new SparkContext(conf)
Related
I want to connect to a document db which has TLS enabled .I could do that from a lambda function with the rds-combined-ca-bundle.pem copied with lambda code .I could not do the same with databricks as all the node of cluster should have this file when spark try to connect it always time out.I tried to create the init scripts by following below link
https://learn.microsoft.com/en-us/azure/databricks/kb/python/import-custom-ca-cert
However it does not help either .Let me know if any one has any clue on this kind of use case .
Note:I can connect to TLS disabled document-db from same databricks instance .
If you are experiencing connection time out errors when using an init script to import the rds-combined-ca-bundle.pem file on your Spark cluster, try the following steps:
Make sure that the rds-combined-ca-bundle.pem file is available on the driver node of your Spark cluster. The init script will only be executed on the driver node. You will encounter connection time out errors otherwise.
Use the --conf option when starting the spark-shell or spark-submit command to specify the location of the rds-combined-ca-bundle.pem file on the driver node. To specify the location of the rds-combined-ca-bundle.pem file, run:
spark-shell --conf spark.mongodb.ssl.caFile=path/to/rds-combined-ca-bundle.pem
Check the Spark cluster logs whether the init script is being executed correctly or if its encountering any errors.
I'm getting the following error SQLException: No suitable driver
I have the spark class set in my code as the following:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/driver/jars --jars /file/path/to/jars'
And when I call spark-submit I specify the jars to make sure that they're found.
spark-submit --driver-class-path /Users/my_user/tools/spark-3.12-bin-hadoop3.2/jars/postgresql-42.2.23.jar --jars /Users/my_user/tools/spark-3.12-bin-hadoop3.2/jars/postgresql-42.2.23.jar postgres_elt.py
I then put my driver and executor in my spark-default template but I'm still getting the same error.
I'm trying to connect to AWS from my local machine. Not sure what I'm doing wrong at this point.
To correct this issue, I added .option("driver", "org.postgresql.Driver") to my variable. This seemed to have fixed my issue and I was able to connect without a problem.
I am learning Spark fundamentals and in order to test my Pyspark application created an EMR instance with Spark, Yarn, Hadoop, Oozie on AWS. I am successfully able to execute a simple pyspark application from the driver node using spark-submit. I have the default /etc/spark/conf/spark-default.conf file created by AWS which is using Yarn Resource Manager. Everything runs fine and I can monitor the Tracking URL as well.
But I am not able to differentiate between whether the spark job is running in 'client' mode or 'cluster' mode. How do I determine that?
Excerpts from /etc/spark/conf/spark-default.conf
spark.master yarn
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address ip-xx-xx-xx-xx.ec2.internal:18080
spark.history.ui.port 18080
spark.shuffle.service.enabled true
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.executor.memory 4743M
spark.executor.cores 2
spark.yarn.executor.memoryOverheadFactor 0.1875
spark.driver.memory 2048M
Excerpts from my pypspark job:
import os.path
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from boto3.session import Session
conf = SparkConf().setAppName('MyFirstPySparkApp')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
....# access S3 bucket
....
....
Is there a deployment mode called 'yarn-client' or is it just 'client' and 'cluster'?
Also, why is "num-executors" not specified in the config file by AWS? Is that something I need to add?
Thanks
It is determined by how you send the option when you submit the job, see the Documentation.
Once you access to the spark history server from the EMR console or by web server, you can find the spark.submit.deployMode option in the Environment tab. In my case, it is client mode.
By default spark application runs in client mode, i.e. driver runs on the node where you're submitting the application from. Details about these deployment configurations can be found here. One easy to verify it would be to kill the running process by pressing ctrl + c on terminal after the job goes to RUNNING state. If it's running on client mode, the app would die. If it's running in cluster mode it would continue to run, because the driver is running in one of the worker nodes in EMR cluster. A sample spark-submit command to run the job in cluster mode would be
spark-submit --master yarn \
--py-files my-dependencies.zip \
--num-executors 10 \
--executor-cores 2 \
--executor-memory 5g \
--name sample-pyspark \
--deploy-mode cluster \
package.pyspark.main
By default number of executors is set to 1. You can check the default values for all spark configs here.
Hello i am writing spark using python and tring to write the dataframe into table and table is hive external and stored on AWS S3
below is the command :
sqlContext.sql(selectQuery).write.mode("overwrite").format(trgFormat).option("compression", trgCompression).save(trgDataFileBase)
Below is the error
ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetection.level=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information.
spark sumit:
spark-submit --master yarn --queue default --deploy-mode client --num-executors 10 --executor-memory 12g --executor-cores 2 --conf spark.debug.maxToStringFields=100 --conf spark.yarn.executor.memoryOverhead=2048
Create a temp table, say trgDataFileBasetmp, then using the same definition create the table on s3. You will need all the parameters in the definition like SERDEPROPERTIES, TBLPROPERTIES. Here the difference I have is saveAsTable:
sqlContext.sql(selectQuery).write.mode("overwrite").format(trgFormat).option("compression", trgCompression).saveAsTable(trgDataFileBase)
If this does not work then you can start with:
sqlContext.sql(selectQuery).write.mode("overwrite").saveAsTable(trgDataFileBase)
I am currently running a real time Spark Streaming job on a cluster with 50 nodes on Spark 1.3 and Python 2.7. The Spark streaming context reads from a directory in HDFS with a batch interval of 180 seconds. Below are the configuration for the Spark Job:
spark-submit --master yarn-client --executor-cores 5 --num-executors 10 --driver-memory 10g --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.yarn.driver.memoryOverhead=2048 --conf spark.network.timeout=300 --executor-memory 10g
The job runs fine for the most part. However, it throws a Py4j Exception after around 15 hours citing it cannot obtain a communication channel.
I tried reducing the Batch Interval size but then it creates an issue where the Processing time is greater than the Batch Interval.
Below is the Screenshot of the Error
Py4jError
I did some research and found that it might be an issue with Socket descriptor leakage from here SPARK-12617
However, I am not able to work around the error and resolve it. Is there a way to manually close the open connections which might be preventing to provide ports. Or do I Have to make any specific changes in the code to resolve this.
TIA