spark job leverage all nodes - amazon-web-services

So my setup on AWS is 1 master node and 2 executor nodes.
I'd expect both 2 executor nodes would work on my task but I can see only one gets registered normally, the other one as ApplicationMaster. I can also see that 16 partitions at the time are processed.
I use spark-shell for now. All the default settings, EMR 4.3. Command to start the shell:
export SPARK_EXECUTOR_MEMORY=20g
export SPARK_DRIVER_MEMORY=4g
spark-shell --num-executors 2 --executor-cores 16 --packages com.databricks:spark-redshift_2.10:0.6.0 --driver-java-options "-Xss100M" --conf spark.driver.maxResultSize=0
Any ideas where to start debugging this? Or is it correct behaviour?

I think the issue is that you are running in 'cluster' mode and the spark driver is running inside an application master on one of the executor nodes, and using 1 core. Therefore because your executors require 16 cores, one of the nodes only has 15 cores available and does not have the required resources to launch a second executor. You can verify this by looking at "Nodes" in the YARN UI.
The solution may be to launch the spark shell in client mode --deploy-mode client or change the number of executor cores.

Related

My GKE pods stoped with error "no command specified: CreateContainerError"

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.
Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

Dataproc Initialization Script not running on master node

I'm setting up a new dataproc server and using initilization-action to run a custom script. The script runs fine on 2 datanodes but not executing on master node.
Tried looking for logs under /var/log/dataprog-initilization-*.log but unable to find the file in the master node.
Has anyone else faced this issue before?
Thanks in advance!!
gcloud command:
gcloud dataproc clusters create test-cluster \
--region=us-central1 --zone=us-central1-a \
--master-machine-type=n1-standard-4 --master-boot-disk-size=200 \
--initialization-actions=gs://dp_init_data/init2.sh --initialization-action-timeout="2m" \
--num-workers=2 --worker-machine-type=n1-standard-8 --worker-boot-disk-size=200
DataNode error log:
2019-07-11 03:29:22,123 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-268987178-10.32.1.248-1562675355441 (Datanode Uuid 71664f82-1d23-4184-b19b-28f86b01a251) service to exp-gcp-kerberos-m.c.exp-cdh-prod.internal/10.32.1.248:8051 Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(10.32.1.60:9866, datanodeUuid=71664f82-1d23-4184-b19b-28f86b01a251, infoPort=0, infoSecurePort=9865, ipcPort=9867, storageInfo=lv=-57;cid=CID-aee57974-1706-4b8c-9654-97da47ad0464;nsid=128710770;c=1562675355441)
According to your DataNode error log, seems you are expecting the init action to be run first on master, then workers. But init actions are run in parallel, you have to add logic to sync between master and workers. I think you can simply add some wait in workers, or if you want something more reliable, write a flag file in GCS when master init is done, check that file in workers.

Is there a way to configure and change Yarn scheduler at runtime?

Currently I am using the default Yarn scheduler but would like to do something like -
Run Yarn using the default scheduler
If (number of jobs in queue > X) {
Change the Yarn scheduler to FIFO
}
Is this even possible through code?
Note that I am running Spark jobs on an aws EMR cluster with Yarn as RM.
Well, it can be possible by having a poller checking current queue(using RM API) and updating yarn-site.xml + probable restart of RM. However, restarting RM can impact your queue because the current jobs will be Killed or Shutdown(and probably retried later).
If you need a more efficient switch between Capacity and FIFO scheduler's , you might as well need to extend those classes and design your own Scheduler which can do the job of your pseudo code.
EMR by default uses capacity scheduler with DefaultResourceCalculator and spins up jobs on Default queue. For example , EMR has yarn configurations on a paths like the following:
/home/hadoop/.versions/2.4.0-amzn-6/etc/hadoop/yarn-site.xml
<property><name>yarn.resourcemanager.scheduler.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value></property>
with
/home/hadoop/.versions/2.4.0-amzn-6/etc/hadoop/capacity-scheduler.xml
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator

Spark Streaming. Issues with Py4j: Error while obtaining a new communication channel

I am currently running a real time Spark Streaming job on a cluster with 50 nodes on Spark 1.3 and Python 2.7. The Spark streaming context reads from a directory in HDFS with a batch interval of 180 seconds. Below are the configuration for the Spark Job:
spark-submit --master yarn-client --executor-cores 5 --num-executors 10 --driver-memory 10g --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.yarn.driver.memoryOverhead=2048 --conf spark.network.timeout=300 --executor-memory 10g
The job runs fine for the most part. However, it throws a Py4j Exception after around 15 hours citing it cannot obtain a communication channel.
I tried reducing the Batch Interval size but then it creates an issue where the Processing time is greater than the Batch Interval.
Below is the Screenshot of the Error
Py4jError
I did some research and found that it might be an issue with Socket descriptor leakage from here SPARK-12617
However, I am not able to work around the error and resolve it. Is there a way to manually close the open connections which might be preventing to provide ports. Or do I Have to make any specific changes in the code to resolve this.
TIA