We are able to run the spark programs in emr 5.9.0 without any issues. But getting the below error when running in emr 5.13.0.
19/11/12 07:09:43 ERROR SparkContext: Error initializing SparkContext.
javax.xml.parsers.FactoryConfigurationError: Provider for class javax.xml.parsers.DocumentBuilderFactory cannot be created
I have added the below dependency in maven. But still getting the same issue. Can anyone please help to fix the issue.
<dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>2.11.0</version>
</dependency>
Thanks
EMR 5.13.0 has spark version 2.3.0 while EMR 5.9.0 has 2.2.0. Try to upgrade spark version in your jar.
Was able to fix the issue after adding the below.
--jars xercesImpl-2.11.0.jar,xml-apis-1.4.01.jar
Thanks
Related
I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos
I started an EMR cluster in order to use test out sqoop but it turns out it doesnt seem to be installed on the latest version of EMR(5.19.0) as I didnt find it in the directory /usr/lib/sqoop. I tried 5.18.0 as well but it was missing there too.
According to the application versions page, sqoop 1.4.7 should be installed on the cluster.
The EMR console gives me a list of 4 "installations". I chose the Core Hadoop package. It has Hive, Hue, etc installed in /usr/lib. Am I missing something here? It's my first time using EMR or sqoop.
I did not see the "Advanced Options" link at the top of the "Create Cluster" page where I can select individual software to install.
When creating an EMR cluster, use the advanced options link where it allows you to select sqoop.
I am trying to spin up a cluster via AWS Cloudera Director. Manager installs fine however upon tailing the installation I find this error.
[2016-06-06 17:16:24] ERROR [pipeline-thread-31] - c.c.l.p.DatabasePipelineRunner: Pipeline '4e04f8e6-5dfc-4603-b58b-9474e054bca6' failed
Any ideas?
Thanks in advance.
That error is a high-level error. Earlier in the log should be an error indicating the root cause of the failure.
I am trying view logs of my running application on Bluemix using : "cf logs my-cool-app" command (CF version 6.11) .
If fails with :
FAILED
Loggregator endpoint missing from config file
Anyone seen this issue?
The problem appears to stem from the use of the 6.11 codebase for CF CLI and the current version of CloudFoundry that Bluemix is running. Good news is that an upcoming upgrade will alleviate the problem. We're investigating potential workarounds.
This is just an issue with the CF CLI version 6.11.
I was following this tutorial for installing Cascading to EMR:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.html
But it failed because of bootstrap action installing the cascading-sdk. The corresponding logs is here: http://pastebin.com/jybHssTQ. As seen from the logs, it's failed because of apt-get not found. Seriously?
I also checked the sdk installation script, and found option to disable installing screen with --no-screen. It is still failed, with different error http://pastebin.com/T6CvA2H1
And now it is because of permission denied. What?
It's official guide, but I can't seem to run it. Any idea?
Rather than changing the script first, try a different EMR AMI version.
AMI versions up until 2.4.8 use Debian OS, where apt-get will work, but this runs Hadoop 1.x. AMI versions 3.0.x run Hadoop 2.2 and use Amazon Linux, which uses Yum.
See Below:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html
Also, try to add the "--tmpdir" option to get around the "Permission Denied" error.