Why would Oozie fail a job with Error Code LimitExceededException when yarn reports that oozie launcher & mapreduce job have completed successfully? - mapreduce

There are a few questions similar to this on SO. However nothing has worked for me. So I am posting this question.
I am Using CDH 6.2.1
I have a workflow that has map-reduce action. The map-reduce job creates a lot of counters (I think m/r job produces ~300 counters).
I have set the cdh/yarn/config mapreduce.job.counters.max property to 8192.
I have also set the:
YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml
YARN Service MapReduce Advanced Configuration Snippet (Safety Valve)
MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml
If I run the map-reduce job as a stand-alone yarn job (using yarn jar command on the command-line), the job completes successfully.
When I run the job as part of the workflow:
On Yarn/All Applications Page I see that: the oozie launcher job completes successfully.
On Yarn/All Applications Page I see that: the map/reduce job completes successfully.
However oozie fails the job reporting: LimitExceededException: Too many counters: 121 max=120
The configuration for the mapreduce job & oozie launcher as reported by yarn has the setting:
<property>
<name>mapreduce.job.counters.max</name>
<value>8192</value>
<final>true</final>
<source>yarn-site.xml</source>
</property>
Oozie web interface System-Info/OS-Env reports that the following HADOOP_CONF_DIR: /var/run/cloudera-scm-agent/process/459-oozie-OOZIE_SERVER/yarn-conf/
In that folder I can see that the mapred-site.xml also has:
<!--'mapreduce.job.counters.max', originally set to '8192' (final), is overridden below by a safety valve-->
<property>
<name>mapreduce.job.counters.max</name>
<value>8192</value>
<final>true</final>
</property>
However I cannot find that property in the yarn-site.xml.
I am not sure what else I can do at this point...

This is an oozie issue which has been resolved. However, it is not available in the current version of cloudera.
I am posting this here, in case anyone else has the same issue.

Related

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

Dataflow process hanging

I am running a batch job on dataflow, querying from BigQuery. When I use the DirectRunner, everything works, and the results are written to a new BigQuery table. Things seem to break when I change to DataflowRunner.
The logs show that 30 worker instances are spun up successfully. The graph diagram in the web UI shows the job has started. The first 3 steps show "Running", the rest show "not started". None of the steps show any records transformed (i.e. outputcollections all show '-'). The logs show many messages that look like this, which may be the issue:
skipping: failed to "StartContainer" for "python" with CrashLoopBackOff: "Back-off 10s restarting failed container=python pod=......
I took a step back and just ran the minimal wordcount example, and that completed successfully. So all the necessary APIs seem to be enabled for Dataflow runner. I'm just trying to get a sense of what is causing my Dataflow job to hang.
I am executing the job like this:
python2.7 script.py --runner DataflowRunner --project projectname --requirements_file requirements.txt --staging_location gs://my-store/staging --temp_location gs://my-store/temp
I'm not sure if my solution was the cause of the error pasted above, but fixing dependencies problems (which were not showing up as errors in the log at all!) did solve the hanging dataflow processes.
So if you have a hanging process, make sure your workers have all their necessary dependencies. You can provide them through the --requirements_file argument, or through a custom setup.py script.
Thanks to the help I received in this post, the pipeline appears to be operating, albeit VERY SLOWLY.

Apache Beam Deploying on DataFlow

Hi I have created an apache beam pipeline, tested it and ran it from inside eclipse, both locally and using dataflow runner. I can see in eclipse console that the pipeline is running I also see the details, i. e. logs on the console.
Now, how do I deploy this pipeline to GCP, so that it keeps working irrespective of the state of my machine. For e.g., if I run it using mvn compile exec:java the console shows it is running, but i can not find the job using the dataflow UI.
Also, what will happen if I kill the process locally, will the job on the GCP infrastructure also be stopped? How Do I know a job has been triggered independent of my machine`s state on the GCP infrastructure?
The maven compile exec:java with arguments output is as follows,
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/C:/Users/ThakurG/.m2/repository/org/slf4j/slf4j-
jdk14/1.7.14/slf4j-jdk14-1.7.14.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/ThakurG/.m2/repository/org/slf4j/slf4j-nop/1.7.25/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Jan 08, 2018 5:33:22 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ main
INFO: starting the process...
Jan 08, 2018 5:33:25 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ
createStream
INFO: pipeline created::Pipeline#73387971
Jan 08, 2018 5:33:27 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ main
INFO: pie crated::Pipeline#73387971
Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
INFO: Message received::1884408,16/09/2017,A,2007156,CLARK RUBBER FRANCHISING PTY LTD,A ,5075,6,Y,296,40467910,-34.868095,138.683535,66 SILKES RD,,,PARADISE,5075,0,7.4,5.6,18/09/2017 2:09,0.22
Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
INFO: Payload from msg::1884408,16/09/2017,A,2007156,CLARK RUBBER FRANCHISING PTY LTD,A ,5075,6,Y,296,40467910,-34.868095,138.683535,66 SILKES RD,,,PARADISE,5075,0,7.4,5.6,18/09/2017 2:09,0.22
Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
This is the maven command I`m using from cmd prompt,
`mvn compile exec:java -Dexec.mainClass=com.trial.apps.gcp.df.ReceiveAndPersistToBQ -Dexec.args="--project=analyticspoc-XXX --stagingLocation=gs://analytics_poc_staging --runner=DataflowRunner --streaming=true"`
This is the piece of code I`m using to create the pipeline and set the options on the same.
PipelineOptions options = PipelineOptionsFactory.create();
DataflowPipelineOptions dfOptions = options.as(DataflowPipelineOptions.class);
dfOptions.setRunner(DataflowRunner.class);
dfOptions.setJobName("gcpgteclipse");
dfOptions.setStreaming(true);
// Then create the pipeline.
Pipeline pipeL = Pipeline.create(dfOptions);
Can you clarify what exactly do you mean by "console shows it is running" and by "can not find the job using Dataflow UI"?
If your program's output prints the message:
To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/.../dataflow/job/....
Then your job is running on the Dataflow service. Once it's running, killing the main program will not stop the job - all the main program does is periodically poll the Dataflow service for the status of the job and new log messages. Following the printed link should take you to the Dataflow UI.
If this message is not printed, then perhaps your program is getting stuck somewhere before actually starting the Dataflow job. If you include your program's output, that will help debugging.
To deploy a pipeline to be executed by Dataflow, you specify the runner and project execution parameters through the command line or via the DataflowPipelineOptions class. runner must be set to DataflowRunner (Apache Beam 2.x.x) and project is set to your GCP project ID. See Specifying Execution Parameters. If you do not see the job in the Dataflow Jobs UI list, then it is definitely not running in Dataflow.
If you kill the process that deploys a job to Dataflow, then the job will continue to run in Dataflow. It will not be stopped.
This is trivial, but to be absolutely clear, you must call run() on the Pipeline object in order for it to be executed (and therefore deployed to Dataflow). The return value of run() is a PipelineResult object which contains various methods for determining the status of a job. For example, you can call pipeline.run().waitUntilFinish(); to force your program to block execution until the job is complete. If your program is blocked, then you know the job was triggered. See the PipelineResult section of the Apache Beam Java SDK docs for all of the available methods.

Structured streaming kafka driver relaunch fails with HDFS file rename errors since new name file already exists

We are testing restarts and failover with structured streaming in Spark 2.1.
We have a stripped down kafka structured streaming driver that only performs an event count. When we relaunch the driver a second time gracefully (i.e. kill driver with yarn application -kill and resubmit with same checkpoint dir), the driver fails due to aborted jobs that cannot commit the state in HDFS with errors like:
"Failed to rename /user/spark/checkpoints/StructuredStreamingSignalCount/ss_signal_count/state/0/11/temp-1769618528278028159 to /user/spark/checkpoints/StructuredStreamingSignalCount/ss_signal_count/state/0/11/128.delta"
When I look in the HDFS, 128.delta already existed before the error. HDFS fundamentally does not allow rename when the target file name already exists with the rename command. Any insight greatly appreciated!
We are using:
spark 2.1.0
HDFS/YARN 2.7.3
Kafka 0.10.1
Heji
A bug in spark for not deleting state file before renaming:
https://issues.apache.org/jira/browse/SPARK-19677

Error starting Spark in EMR 4.0

I created an EMR 4.0 instance in AWS with all available applications, including Spark. I did it manually, through AWS Console. I started the cluster and SSHed to the master node when it was up. There I ran pyspark. I am getting the following error when pyspark tries to create SparkContext:
2015-09-03 19:36:04,195 ERROR Thread-3 spark.SparkContext
(Logging.scala:logError(96)) - -ec2-user, access=WRITE,
inode="/user":hdfs:hadoop:drwxr-xr-x at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
I haven't added any custom applications, nor bootstrapping and expected everything to work without errors. Not sure what's going on. Any suggestions will be greatly appreciated.
Login as the user "hadoop" (http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-connect-master-node-ssh.html). It has all the proper environment and related settings for working as expected. The error you are receiving is due to logging in as "ec2-user".
I've been working with Spark on EMR this week, and found a few weird things relating to user permissions and relative paths.
It seems that running Spark from a directory which you don't 'own', as a user, is problematic. In some situations Spark (or some of the underlying Java pieces) want to create files or folders, and they think that pwd - the current directory - is the best place to do that.
Try going to the home directory
cd ~
then running pyspark.