flume getting error while updating data in HDFS - hdfs

I need to run flume in separate machine which is not part of HDFS data node or name node and it has to read data from Kafka and store it in HDFS running in a separate cluster. Can it be done? I am getting errors related to hadoop jar files.

Apache Flume requires Hadoop jars for HDFS Sink since you are reading data from kafka and storing back in HDFS.
Please add all hadoop related jars in the classpath and then rerun it.

Related

Add pyspark script as AWS step

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command
aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE
I have downloaded the spark-xml jar to the master node during bootstrap and its present under
/home/hadoop
location. Also in the python script I have included
conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")
But still its showing
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes.
Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.
Please refer this article for more info.
So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.
Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

Pyspark error reading file. Flume HDFS sink imports file with user=flume and permissions 644

I'm using Cloudera Quickstart VM 5.12
I have a Flume agent moving CSV files from spooldir source into HDFS sink. The operation works ok but the imported files have:
User=flume
Group=cloudera
Permissions=-rw-r--r--
The problem starts when I use Pyspark and get:
PriviledgedActionException as:cloudera (auth:SIMPLE)
cause:org.apache.hadoop.security.AccessControlException: Permission denied:
user=cloudera, access=EXECUTE,
inode=/user/cloudera/flume/events/small.csv:cloudera:cloudera:-rw-r--r--
(Ancestor /user/cloudera/flume/events/small.csv is not a directory).
If I use "hdfs dfs -put ..." instead of Flume, user and group are "cloudera" and permissions are 777. No Spark error
What is the solution? I cannot find a way from Flume to change file's permissions. Maybe my approach is fundamentally wrong
Any ideas?
Thank you

Structured streaming kafka driver relaunch fails with HDFS file rename errors since new name file already exists

We are testing restarts and failover with structured streaming in Spark 2.1.
We have a stripped down kafka structured streaming driver that only performs an event count. When we relaunch the driver a second time gracefully (i.e. kill driver with yarn application -kill and resubmit with same checkpoint dir), the driver fails due to aborted jobs that cannot commit the state in HDFS with errors like:
"Failed to rename /user/spark/checkpoints/StructuredStreamingSignalCount/ss_signal_count/state/0/11/temp-1769618528278028159 to /user/spark/checkpoints/StructuredStreamingSignalCount/ss_signal_count/state/0/11/128.delta"
When I look in the HDFS, 128.delta already existed before the error. HDFS fundamentally does not allow rename when the target file name already exists with the rename command. Any insight greatly appreciated!
We are using:
spark 2.1.0
HDFS/YARN 2.7.3
Kafka 0.10.1
Heji
A bug in spark for not deleting state file before renaming:
https://issues.apache.org/jira/browse/SPARK-19677

HDFS directory on a MAPR cluster

I need to save my Spark Streaming checkpoint files on a HDFS directory. I can access a remote cluster which has MAPR installed on it.
But, I am not sure on which path MAPR denoting to a HDFS directory
is it opt/mapr/..?
When you are connected to your MapR cluster you can run the following command:
hadoop fs -ls /
This will list, like inside any HDFS cluster the list of files/folders, so you see here anything special.
So if your Spark job is running on MapR cluster you just have to point to the folder your want, for example:
yourRdd.saveAsTextFile("/apps/output");
You can do exactly the same from your development environment, but you have to install and configure the MapR-Client
Note that you can also access MapR File System (FS) using NFS, that should run on your cluster, by default the mount point is /mapr
So you can see the content of your FS using:
cd /mapr/you-cluster-name/apps/output
/mapr/opt is the folder that contains the MapR installed product.
So if you look at it from a pure Spark point of view: nothing change just save/read data from a folder, if you are running in MapR this will be done in MapR-FS.

Spark - Writing into HDFS does not complete successfully

My question is similar to (Spark writing to hdfs not working with the saveAsNewAPIHadoopFile method)! I am using Spark 1.1.0 on CDH 5.2.1
I am trying to save a file to hdfs system through saveAsTextFile method from Spark. The job completes successfully but when I look into the folder path, I see _temporary folder with data files inside it in various tasks and attempt folder. This tells me Spark is quitting the job as succeeded even before the files are completely moved into hdfs in the right output folder. This is the same issue with saveAsParquetFile method too. Please let me know if you have any idea about this?
Thanks