apache spark saveAsObjectFile writes to hdfs by default - hdfs

When I am running spark locally(non hdfs), RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt)
when I am running spark on YARN cluster, RDD saveAsObjectFile writes the file to hdfs. (ex : path /data/temp.txt )
Is there a way to explictly mention local file system instead of hdfs when running spark on YARN cluster.

You can explicitly specify "file:///" prefix in the argument.
yourRDD. saveAsObjectFile("file:///path/to/local/filesystem")

Related

Add pyspark script as AWS step

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command
aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE
I have downloaded the spark-xml jar to the master node during bootstrap and its present under
/home/hadoop
location. Also in the python script I have included
conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")
But still its showing
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes.
Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.
Please refer this article for more info.
So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.
Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

Unable to copy from local file system to hdfs

I am trying to copy a text file on my Mac Desktop to hdfs, for that purpose I am using this code
hadoop fs -copyFromLocal Users/Vishnu/Desktop/deckofcards.txt /user/gsaikiran/cards1
But it is throwing an Error
copyFromLocal: `deckofcards.txt': No such file or directory
It sure exists on the desktop
Your command is missing a slash / at the source file path. It should be:
hadoop fs -copyFromLocal /Users/Vishnu/Desktop/deckofcards.txt /user/gsaikiran/cards1
more correctly/efficiently,
hdfs dfs -put /Users/Vishnu/Desktop/deckofcards.txt /user/gsaikiran/cards1
Also, if you are dealing with HDFS specifically, better to use hdfs dfs syntax instead of hadoop fs [1]. (It doesn't change the output in your case, but hdfs dfs command is designed for interacting with HDFS whereas hadoop fs is the deprecated one)

import mysql data into hdfs contents file not read CCA

I want to see contents of the hdfs file which I have import mysql data using sqoop.
I ran the command hadoop dfs -cat /user/cloudera/products/part-m-00000.
I am getting error:
cat: Zero blocklocations for /user/cloudera/products/part-m-00000. Name node is in safe mode.
It s not possible to read hdfs data when namenode is in safe mode. Leave from safe mode, run the below command,
hadoop dfsadmin -safemode leave
Then cat the file in hdfs.

HDFS directory on a MAPR cluster

I need to save my Spark Streaming checkpoint files on a HDFS directory. I can access a remote cluster which has MAPR installed on it.
But, I am not sure on which path MAPR denoting to a HDFS directory
is it opt/mapr/..?
When you are connected to your MapR cluster you can run the following command:
hadoop fs -ls /
This will list, like inside any HDFS cluster the list of files/folders, so you see here anything special.
So if your Spark job is running on MapR cluster you just have to point to the folder your want, for example:
yourRdd.saveAsTextFile("/apps/output");
You can do exactly the same from your development environment, but you have to install and configure the MapR-Client
Note that you can also access MapR File System (FS) using NFS, that should run on your cluster, by default the mount point is /mapr
So you can see the content of your FS using:
cd /mapr/you-cluster-name/apps/output
/mapr/opt is the folder that contains the MapR installed product.
So if you look at it from a pure Spark point of view: nothing change just save/read data from a folder, if you are running in MapR this will be done in MapR-FS.

flume getting error while updating data in HDFS

I need to run flume in separate machine which is not part of HDFS data node or name node and it has to read data from Kafka and store it in HDFS running in a separate cluster. Can it be done? I am getting errors related to hadoop jar files.
Apache Flume requires Hadoop jars for HDFS Sink since you are reading data from kafka and storing back in HDFS.
Please add all hadoop related jars in the classpath and then rerun it.