The goal is to upload index config files from HDFS to Solr via Oozie workflow (Java action).
CloudSolrClient.uploadConfig() accepts a java.nio.file.Paths object in the first parameter.
Running on HDFS (through Oozie launcher), it cannot resolve the config path unless I use org.apache.hadoop.fs.Path.
How can I get around this? Is there a version of CloudSolrClient that accepts hadoop.fs.Path as a parameter?
Related
I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation
I am running Hive from a container (this image: https://hub.docker.com/r/bde2020/hive/) in my local computer.
I am trying to create a Hive table stored as a CSV in S3 with the following command:
CREATE EXTERNAL TABLE local_test (name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://mybucket/local_test/';
However, I am getting the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException No FileSystem for scheme: s3)
What is causing it?
Do I need to set up something else?
Note:
I am able to run aws s3 ls mybucket and also to create Hive tables in another directory, like /tmp/.
Problem discussed here.
https://github.com/ramhiser/spark-kubernetes/issues/3
You need to add reference to aws sdk jars to hive library path. That way it can recognize file schemes,
s3, s3n, and s3a
Hope it helps.
EDIT1:
hadoop-aws-2.7.4 has implementations on how to interact with those file systems. Verifying the jar it has all the implementations to handle those schema.
org.apache.hadoop.fs tells hadoop to see which file system implementation it need to look.
Below classes are implamented in those jar,
org.apache.hadoop.fs.[s3|s3a|s3native]
The only thing still missing is, the library is not getting added to hive library path. Is there anyway you can verify that path is added to hive library path?
EDIT2:
Reference to library path setting,
How can I access S3/S3n from a local Hadoop 2.6 installation?
I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job.
I wanted to do the following
Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))
This always fails with the following error
java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020
in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020
The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS.
But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.
How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.
PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs?
UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?
In my case I needed to read a parquet file that was generated by prior EMR jobs, I was looking for list of files for a given s3 prefix, but nice thing is we don't need to do all that, we can simply do this:
spark.read.parquet(bucket+prefix_directory)
URI.create() should be used to point it to correct Filesystem.
val fs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val dirPaths = FileSystem.get(URI.create("<s3-path>"), fs.getConf).listStatus(new Path("<s3-path>"))```
I don't think you are picking up the FS right; just use the static FileSystem.get() method, or Path.get()
Try something like:
Path p = new Path("s3://bucket/subdir");
FileSystem fs = p.get(conf);
FileStatus[] status= fs.listStatus(p);
Regarding logs, YARN UI should let you at them via the node managers.
I'm a first time EMR/Hadoop user and first time Apache Nutch user. I'm trying to use Apache Nutch 2.1 to do some screen scraping. I'd like to run it on hadoop, but don't want to setup my own cluster (one learning curve at a time). So I'm using EMR. And I'd like S3 to be used for output (and whatever input I need).
I've been reading the setup wikis for Nutch:
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/NutchHadoopTutorial
And they've been very helpful in getting me up to speed on the very basics of nutch. I realize I can build nutch from source, preconfigure some regexes, then be left with a hadoop friendly jar:
$NUTCH_HOME/runtime/deploy/apache-nutch-2.1.job
Most of the tutorials culminate in a crawl command being run. In the Hadoop examples, it's:
hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5
And in the local deployment example it's something like:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
My question is as follows. What do I have to do to get my apache-nutch-2.1.job to run on EMR? What arguments to I pass it? For the hadoop crawl example above, the "urls" file is already on hdfs with seed URLs. How do I do this on EMR? Also, what do I specify on the command line to have my final output to go S3 instead of HDFS?
So to start off, this can not really to be done using the GUI. Instead I got nutch working using the AWS Java API.
I have my seed files located in s3, and I transfer my output back to s3.
I use the dsdistcp jar to copy the data from s3 to hdfs.
Here is my basic step config. MAINCLASS is going to be the package details of your nutch crawl. Something like org.apach.nutch.mainclass.
String HDFSDIR = "/user/hadoop/data/";
stepconfigs.add(new StepConfig()
.withName("Add Data")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig(prop.getProperty("S3DISTCP"))
.withArgs("--src", prop.getProperty("DATAFOLDER"), "--dest", HDFSDIR)));
stepconfigs.add(new StepConfig()
.withName("Run Main Job")
.withActionOnFailure(ActionOnFailure.CONTINUE)
.withHadoopJarStep(new HadoopJarStepConfig(nutch-1.7.jar)
.withArgs("org.apache.nutch.crawl.Crawl", prop.getProperty("CONF"), prop.getProperty("STEPS"), "-id=" + jobId)));
stepconfigs.add(new StepConfig()
.withName("Pull Output")
.withActionOnFailure(ActionOnFailure.TERMINATE_JOB_FLOW)
.withHadoopJarStep(new HadoopJarStepConfig(prop.getProperty("S3DISTCP"))
.withArgs("--src", HDFSDIR, "--dest", prop.getProperty("DATAFOLDER"))));
new AmazonElasticMapReduceClient(new PropertiesCredentials(new File("AwsCredentials.properties")), proxy ? new ClientConfiguration().withProxyHost("dawebproxy00.americas.nokia.com").withProxyPort(8080) : null)
.runJobFlow(new RunJobFlowRequest()
.withName("Job: " + jobId)
.withLogUri(prop.getProperty("LOGDIR"))
.withAmiVersion(prop.getProperty("AMIVERSION"))
.withSteps(getStepConfig(prop, jobId))
.withBootstrapActions(getBootStrap(prop))
.withInstances(getJobFlowInstancesConfig(prop)));
I have a web UI that tries to spawn a MR job on HBase table. I keep getting this error though:
java.io.FileNotFoundException: File /opt/hadoop/mapreduce/system/job_201205251929_0007/libjars/zookeeper-3.3.2.jar does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
I am running with hbase 0.90.4. HBase manages its own zookeeper. And, I confirmed that I have /opt/hadoop/mapreduce/system/job_201205251929_0007/libjars/zookeeper-3.3.2.jar in my HDFS. Is it looking in Local FS?
I found that I did not have core-site.xml on my classpath and it was taking local FS for fs.default.name instead of the HDFS. The jar existed in HDFS but it was looking in local FS.
Any jar files accessed in the mapper or reducer need to be in the local filesystem on all the nodes in the cluster. Check your local FS.