HDFS directory on a MAPR cluster - hdfs

I need to save my Spark Streaming checkpoint files on a HDFS directory. I can access a remote cluster which has MAPR installed on it.
But, I am not sure on which path MAPR denoting to a HDFS directory
is it opt/mapr/..?

When you are connected to your MapR cluster you can run the following command:
hadoop fs -ls /
This will list, like inside any HDFS cluster the list of files/folders, so you see here anything special.
So if your Spark job is running on MapR cluster you just have to point to the folder your want, for example:
yourRdd.saveAsTextFile("/apps/output");
You can do exactly the same from your development environment, but you have to install and configure the MapR-Client
Note that you can also access MapR File System (FS) using NFS, that should run on your cluster, by default the mount point is /mapr
So you can see the content of your FS using:
cd /mapr/you-cluster-name/apps/output
/mapr/opt is the folder that contains the MapR installed product.
So if you look at it from a pure Spark point of view: nothing change just save/read data from a folder, if you are running in MapR this will be done in MapR-FS.

Related

Add pyspark script as AWS step

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command
aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE
I have downloaded the spark-xml jar to the master node during bootstrap and its present under
/home/hadoop
location. Also in the python script I have included
conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")
But still its showing
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes.
Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.
Please refer this article for more info.
So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.
Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

while running hadoop fs -ls getting error no cldb entries

while running hadoop fs -ls getting error, please see below screenshot.
When running "hadoop fs -ls", the list of CLDB servers will be read from the /opt/mapr/conf/mapr-clusters.conf file. That file should be populated based on the CLDB addresses provided when running the /opt/mapr/server/configure.sh command. Either, the contents of mapr-clusters.conf do not specify the correct CLDB hosts (or any servers), or the CLDB service is not running on the specified hosts. I'd suggest ensuring the MapR services are running on the nodes where they've been installed and reviewing the installation/configuration documentation.

disk full issue in Hadoop yarn

I am using EMR & running a spark streaming job with yarn as resource manager and Hadoop 2.7.3-amzn-0, facing disk full issue so, I have implemented two properties yarn.nodemanager.localizer.cache.cleanup.interval-ms as 600000 and yarn.nodemanager.localizer.cache.target-size-mb as 1024 in yarn-site.xml then also my disk space is become full after some time and cross the limit which I configured. It seems my configured properties are not working and manually I have to clean the disk using this command: rm -rf filecache/ usercache/
every spark job is creating directory in filecache like:- /mnt/yarn/usercache/hadoop/filecache/5631/__spark_libs__8149715201399895593.zip having all .jar files.
What to do for automatically clean the filecache & usercache? On which file and location do i have to change?
Can anyone please help.

How to run pyspark on EC2 with IPython starting from the spark-ec2 launch process?

Three steps and I have a spark context in my IPython notebook:
1.) Launch spark on EC2 using the these instructions.
2.) Install anaconda and py4j on every node (set PATH accordingly).
3.) Login to master, cd to the spark folder, then run:
MASTER=spark://<public DNS>:7077 PYSPARK_PYTHON=~/anaconda2/bin/python PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="*"' ./bin/pyspark
This process makes the IPython notebook available on < master public DNS >:8888, which is great, but ... I am currently using a csshx style solution to accomplish step 2.
Question:
How can I set install requirements (on AWS or elsewhere) so that the spark-ec2 script spins up machines with the desired setup?
If that's not possible or simply clunky, what would you suggest? (command line only solutions are preferred)

apache spark saveAsObjectFile writes to hdfs by default

When I am running spark locally(non hdfs), RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt)
when I am running spark on YARN cluster, RDD saveAsObjectFile writes the file to hdfs. (ex : path /data/temp.txt )
Is there a way to explictly mention local file system instead of hdfs when running spark on YARN cluster.
You can explicitly specify "file:///" prefix in the argument.
yourRDD. saveAsObjectFile("file:///path/to/local/filesystem")