Output Without Additional File From a MapReduce Job in Hadoop - mapreduce

How to generate output from a MapReduce job without having the additional _SUCCESS file in the output repository?

use this property in your java program.
conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");

Related

No such file exists while running Hadoop pipes using c++

While running hadoop map reduce program using hadoop pipes, the file which is present in the hdfs is not found by the map reduce. If the program is executed without hadoop pipes, the file is easily found by the libhdfs library but when running the program with
hadoop pipes -input i -ouput o -program p
command, the file is not found by the libhdfs and java.io.exception is thrown. Have tried to include the -fs parameter in the command but still the same results. I Have also included hdfs://localhost:9000/ with the files, and still no results. The file parameter is inside the c code as:
file="/path/to/file/in/hdfs" or "hdfs://localhost:9000/path/to/file"
hdfsFS fs = hdfsConnect("localhost", 9000);
hdfsFile input=hdfsOpenFile(fs,file,O_RDONLY,0,0,0);
Found the problem. The files in the hdfs are not available to the mapreduce task node. So instead had to pass the files to the distributed cache through the archive tag by compressing the files to a single tar file. Can also achieve this by writing a custom inputformat class and provide the files in the input parameter.

How to tar a folder in HDFS?

Just like Unix command tar -czf xxx.tgz xxx/, is there a method can do the same thing in HDFS? I have a folder in HDFS has over 100k small files, and want to download it to local file system as fast as possible. hadoop fs -get is too slowly, I know hadoop archive can output a har, but it seems cannot solve my problem.
From what I see here,
https://issues.apache.org/jira/browse/HADOOP-7519
it is not possible to perform tar operation using hadoop commands. This has been filed as an improvement as I mentioned above and not resolved/available yet to use.
Hope this answers your question.
Regarding your scenario - having 100k small files in HDFS is not a good practice. You can find a way to merge them all (may be by creating tables through Hive or Impala from this data) or move all the small files to a single folder in HDFS and use hadoop fs -copyToLocal <HDFS_FOLDER_PATH>; to get the whole folder to your local along with all the files in it.

Build AWS Java Lambda with Gradle, use shadowJar or buildZip for archive to upload?

Description
I am developing AWS Java Lambdas, with Gradle as my build tool.
AWS requires a "self-contained" Java archive (.jar, .zip, ...) to be uploaded, which has to include everything, my source code, the dependencies etc.
There is the Gradle plugin shadow for this purpose, it can be included like this:
import com.github.jengelman.gradle.plugins.shadow.transformers.Log4j2PluginsCacheFileTransformer
...
shadowJar {
archiveName = "${project.name}.jar"
mergeServiceFiles()
transform(Log4j2PluginsCacheFileTransformer)
}
build.dependsOn shadowJar
gradle build produces a file somefunction.jar, in my case it is 9.5MB in size.
AWS documentation suggests to
putting your dependency .jar files in a separate /lib directory
There are specific instructions how to do this on Creating a ZIP Deployment Package for a Java Function.
task buildZip(type: Zip) {
archiveName = "${project.name}.zip"
from compileJava
from processResources
into('lib') {
from configurations.runtimeClasspath
}
}
build.dependsOn buildZip
gradle build produces a file build/distributions/somefunction.zip, in my case it is 8.5MB in size.
Both archives, zip and jar, can be upload to AWS and run fine. Performance seems to be the same.
Question
Which archive to favor, Zip or (shdow)Jar?
More specific questions, which come to my mind:
AWS documetation says "This [putting your dependency .jar files in a separate /lib directory] is faster than putting all your function’s code in a single jar with a large number of .class files." Does anyone know, what exactly is faster? Build-time? Cold/warm start? Execution time?
When build the Zip, I am not using the shadowJar features mergeServiceFiles() and Log4j2PluginsCacheFileTransformer. Not using mergeServiceFiles should in worst case decrease the execution time. As long as I omit Log4j2 plugins, I can omit Log4j2PluginsCacheFileTransformer. Right?
Are there any performance considerations using the one or the other?

Compiling SQL Archive Tool

I am using SQLite database and I want to archive database by using SQL Archive Tool (SQLAR) but i don't know how to compile. I found only this document which is given below. Could you help me to compile and use SQLAR Tool to use it ?
https://www.sqlite.org/sqlar/tree?ci=trunk&expand
After these steps are done, i can compress the database.
Download source files. (https://www.sqlite.org/sqlar/vinfo?name=15adeb2f9a1b0b8c)
Copy files to your server.
Write "make" command in your sqlar path which includes source files.
Write command "./sqlar archive_name database_name" to archive your database.

Question on Build scripts & RTC Build

I have a batch file which is calling CMake which also does some functionality
I want to call this batch file for Build.
If for some reason, CMake fails and throws error the same is not reported as failure in RTC. If my understanding is correct RTC is calling the Batch file and the Batch file calls CMake. The execution of batch file is successful and hence it is reported as success.
But I want the RTC to report CMake is failed which is called via Batch files
How can i achieve this?
I was looking at creating Ant tasks but don't have one proper example
thank you
Karthik
You will want to use the ANT exec task.
http://ant.apache.org/manual/Tasks/exec.html
There is an example in the documentation of calling a .bat file. You will also want to use the failonerror ="true" attribute to ensure you fail the RTC build if the bat file fails. Additionally you need to ensure that your bat file is indeed failing (returning a non-zero return code) if the CMake command fails.