spark ui shows zero executors - amazon-web-services

I am running spark 2.0.1 using below options in
SparkSession.builder().master(master).appName(appName).config(conf).getOrCreate();
opts.put("spark.serializer","org.apache.spark.serializer.KryoSerializer");
opts.put("spark.executor.extraJavaOptions","-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseG1GC -Djava.security.egd=file:///dev/urandom");
opts.put("spark.driver.maxResultSize","0");
opts.put("spark.sql.shuffle.partitions","200");
opts.put("spark.sql.warehouse.dir","/opt/astra/spark-warehouse");
opts.put("spark.scheduler.mode","FAIR");
opts.put("spark.executor.memory","5g");
opts.put("spark.executor.cores","2");
opts.put("spark.kryoserializer.buffer.max","1g");
opts.put("spark.parquet.block.size","134217728");
Spark master is running in AWS on ec2 instance. In spark master ui I can see memory, cores all. But when running job in Job UI executors as below
Also while looking at thread dump I see lots of connection related threads awaiting.
Can someone please point to me what's happening and where to look. As commented here is spark master's snapshot showing allocated resources.
On logs the system seems waiting for resources providing below link
16/11/08 12:46:37 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Related

Why is it not provisioned when running a data pipeline in a data fusion?

I am using DataFusion Enterprise.
Datafusion>system admin>configuration>system compute profiles>create new profile
on this route
I set the configuration value of Master Node, Worker Node.
And I set configuration for each data pipeline. (Executor, Driver)
Now, while deploying and running the data pipeline, the provisioning state does not move on to the next startup phase.
The issue is as follows.
1.Dataproc
CreateCluster
asia-northest3:cdap-eventgmkt-7e769e35-182d-11b-9d9d-ce8dcdf883
service-125051400193#gcp-sa-datafusion.iam.gserviceaccount.com
Multiple Errors: - Timeout waiting for instance cdap-eventgmkt-7e769e35-182d-11b-9d9d-ce8dcdf88803-m to report in. - Timeout waiting for instance cdap-eventgmt-7769-9d-1d-1d1
Dataproc
DeleteCluster
asia-northest3:cdap-eventgmkt-7e769e35-182d-11b-9d9d-ce8dcdf883
service-125051400193#gcp-sa-datafusion.iam.gserviceaccount.com
Cannot delete cluster 'cdap-eventgmkt-7e769e35-182d-11eb-9d9d-ce8dcdf88803' when it has other pending delete operations.
In short, the data pipeline is running, but it's no longer being provisioned.
How do you solve these problems?
Thank you.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

How to relaunch a Spark executor after it crashes (in YARN client mode)?

Is it possible to relaunch a Spark executor after it crashes? I understand that the failed tasks are re-run in the existing working Spark executors, but I hope there is a way to relaunch the crashed Spark executor.
I am running pyspark 1.6 on YARN, in client mode
No. It is not possible. Spark takes care of it and when an executor dies, it will request a new one the next time it asks for "resource containers" for executors.
If the executor was close to the data to process Spark will request for a new executor given locality preferences of the task(s) and chances are that the host where the executor has died will be used again to run the new one.
An executor is a JVM process that spawns threads for tasks and honestly does not do much. If you're concerned with the data blocks you should consider using Spark's external shuffle service.
Consider reading the document Job Scheduling in the official documentation.

Could an HDFS read/write process be suspended/resumed?

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

Cannot do simple task on ec2 spark cluster from local pyspark

I am trying to execute pyspark from my mac to do compute on a EC2 spark cluster.
If I login to the cluster, it works as expected:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark
Then do a simple task
>>> data=sc.parallelize(range(1000),10)`
>>> data.count()
Works as expected:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000
But now if I try the same thing from local machine,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
it can't seem to connect to the cluster
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077...
14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...
File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect.
: org.apache.spark.SparkException: Job aborted: Spark cluster looks down
14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I thought the problem was in the ec2 security but it does not help even after adding inbound rules to both master and slave security groups to accept all ports.
Any help will be greatly appreciated!
Others are asking same question on mailing list
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html#a8465
The spark-ec2 script configure the Spark Cluster in EC2 as standalone, which mean it can not work with remote submits. I've been struggled with this same error you described for days before figure out it's not supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your spark task.
In my experience Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory usually means you have accidentally set the cores too high, or set the executer memory too high - i.e. higher than what your nodes actually have.
Other, less likely causes, could be you got the URI wrong and your not really connecting to the master. And once I saw that problem when the /run partition was 100%.
Even less likely, your cluster may actually be down, and you need to restart your spark workers.