zookeeper jar not found in HBase MR job - mapreduce

I have a web UI that tries to spawn a MR job on HBase table. I keep getting this error though:
java.io.FileNotFoundException: File /opt/hadoop/mapreduce/system/job_201205251929_0007/libjars/zookeeper-3.3.2.jar does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
I am running with hbase 0.90.4. HBase manages its own zookeeper. And, I confirmed that I have /opt/hadoop/mapreduce/system/job_201205251929_0007/libjars/zookeeper-3.3.2.jar in my HDFS. Is it looking in Local FS?

I found that I did not have core-site.xml on my classpath and it was taking local FS for fs.default.name instead of the HDFS. The jar existed in HDFS but it was looking in local FS.

Any jar files accessed in the mapper or reducer need to be in the local filesystem on all the nodes in the cluster. Check your local FS.

Related

Apache Spark - Write Parquet Files to S3 with both Dynamic Partition Overwrite and S3 Committer

I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation

How the Hadoop History Server is working?

There are 2 properties within configuration files I am confused with:
The property yarn.nodemanager.remote-app-log-dir in yarn-site.xml:
a.) This property controls, where the logs of map/reduce tasks will be logged?
b.) This is the responsibility of Node Manager (NM)?
The property mapreduce.jobhistory.done-dir from mapred-site.xml:
a.) Job related files like configurations etc. are stored in this location?
b.) This is the responsibility of Application Master (AM)?
Does the History Server (HS) combines both of these information and shows a consolidated information in UI?
Assuming you have enabled log-aggregation,
1.a. This is the log-aggregation dir, usually HDFS where NMs aggregate container-logs to.
1.b. Yes.
2.a. Yes.
2.b. No. MR JobHistory Server will do that, by deleting JobSummary file and mv other files to ${mapreduce.jobhistory.done-dir} from ${mapreduce.jobhistory.intermediate-done-dir}.
3. Yes. MR JobHistory Server Web, includes job info(from ${mapreduce.jobhistory.done-dir}) and container logs(from ${yarn.nodemanager.remote-app-log-dir}).

Real time usecase scenarios of mvtolocal in HDFS

There is a command mvtolocal in hdfs. If we use this command, data in hdfs will be removed and copied to local. As per my knowledge, data should not be modified once it is loaded into hdfs. In which cases/real time scenarios, do we need to use these commands. Thanks.

SolrJ CloudSolrClient incompatible with Hadoop fs Path

The goal is to upload index config files from HDFS to Solr via Oozie workflow (Java action).
CloudSolrClient.uploadConfig() accepts a java.nio.file.Paths object in the first parameter.
Running on HDFS (through Oozie launcher), it cannot resolve the config path unless I use org.apache.hadoop.fs.Path.
How can I get around this? Is there a version of CloudSolrClient that accepts hadoop.fs.Path as a parameter?

Load Data using Apache-Spark on AWS

I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I've created one master and two slave nodes. On the master node, I have a directory data containing all data files with csv format to be processed.
Now before we submit the driver program (which is my python code) to run, we need to copy the data directory data from the master to all slave nodes. For my understanding, I think it is because each slave node needs to know data file location in its own local file systems so it can load data file. For example,
from pyspark import SparkConf, SparkContext
### Initialize the SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf)
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data") ### Read data directory
### Collect files from the RDD
datafile.collect()
When each slave node runs the task, it loads data file from its local file system.
However, before we submit my application to run, we also have to put the directory data into the Hadoop Distributed File System (HDFS) using $ ./ephemeral-hdfs/bin/hadoop fs -put /root/data/ ~.
Now I get confused about this process. Does each slave node load data files from its own local file system or HDFS? If it loads data from the local file system, why do we need to put data into HDFS? I would appreciate if anyone can help me.
Just to clarify for others that may come across this post.
I believe your confusion is due to not providing a protocol in the file location. When you do the following line:
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data") ### Read data directory
Spark assumes the file path /root/data is in HDFS. In other words it looks for the files at hdfs:///root/data.
You only need the files in one location, either locally on every node (not the most efficient in terms of storage) or in HDFS that is distributed across the nodes.
If you wish to read files from local, use file:///path/to/local/file. If you wish to use HDFS use hdfs:///path/to/hdfs/file.
Hope this helps.
One quick suggestion is to load csv from S3 instead of having it in local.
Here is a sample scala snippet which can be used to load a bucket from S3
val csvs3Path = "s3n://REPLACE_WITH_YOUR_ACCESS_KEY:REPLACE_WITH_YOUR_SECRET_KEY#REPLACE_WITH_YOUR_S3_BUCKET"
val dataframe = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
load(leadsS3Path)