Chaining Hadoop MapReduce with Pipes (C++)

Chaining Hadoop MapReduce with Pipes (C++) - c++

Does anyone know how to chain two MapReduce with Pipes API?
I already chain two MapReduce in a previous project with JAVA, but today I need to use C++. Unfortunately, I haven't seen any examples in C++.
Has someone already done it? Is it impossible?

Use Oozie Workflow. It allows you to use Pipes along with usual MapReduce jobs.

I finally manage to make Hadoop Pipes works. Here some steps to make works the wordcount examples available in src/examples/pipes/impl/.
I have a working Hadoop 1.0.4 cluster, configured following the steps described in the documentation.
To write a Pipes job I had to include the pipes library that is already compiled in the initial package. This can be found in C++ folder for both 32-bit and 64-bit architecture. However, I had to recompile it, which can be done following those steps:
# cd /src/c++/utils
# ./configure
# make install
# cd /src/c++/pipes
# ./configure
# make install
Those two commands will compile the library for our architecture and create a ’install’ directory in /src/c++ containing the compiled files.
Moreover, I had to add −lssl and −lcrypto link flags to compile my program. Without them I encountered some authentication exception at the running time.
Thanks to those steps I was able to run wordcount−simple that can be found in src/examples/pipes/impl/ directory.
However, to run the more complex example wordcount−nopipe, I had to do some other points. Due to the implementation of the record reader and record writer, we are directly reading or writing from the local file system. That’s why we have to specify our input and output path with file://. Moreover, we have to use a dedicated InputFormat component. Thus, to launch this job I had to use the following command:
# bin/hadoop pipes −D hadoop.pipes.java.recordreader=false −D hadoop.pipes.java.recordwriter=false −libjars hadoop−1.0.4/build/hadoop−test−1.0.4.jar −inputformat org.apache.hadoop.mapred.pipes.WordCountInputFormat −input file:///input/file −output file:///tmp/output −program wordcount−nopipe
Furthermore, if we look at org.apache.hadoop.mapred.pipes.Submitter.java of 1.0.4 version, the current implementation disables the ability to specify a non java record reader if you use InputFormat option.
Thus you have to comment the line setIsJavaRecordReader(job,true); to make it possible and recompile the core sources to take into account this change (http://web.archiveorange.com/archive/v/RNVYmvP08OiqufSh0cjR).
if(results.hasOption("−inputformat")) {
setIsJavaRecordReader(job, true);
job.setInputFormat(getClass(results, "−inputformat", job,InputFormat.class));
}

Related

"Embedding" a folder into a C/C++ program

I have a script library stored in .../lib/ that I want to embed into my program. So far, that sounds simple: On Windows, I'd use Windows Resource Files - on MacOS, I'd put them into a Resource folder and use the proper API to access the current bundle and it's resources. On plain Linux, I am not too sure how to do it... But, I want to be cross-platform anyway.
Now, I know that there are tools like IncBin (https://github.com/graphitemaster/incbin) and alike, but they are best used for single files. What I have, however, might even require some kind of file system abstraction.
So here is the few guesses and estimates I did. I'd like to know if there is possibly a better solution - or others, in general.
Create a Zip file and use MiniZ in order to read it's contents off a char array. Basically, running the zip file through IncBin and passing it as a buffer to MiniZ to let me work on that.
Use an abstracted FS layer like PhysicsFS or TTVFS and add the possibility to work off a Zip file or any other kind of archive.
Are there other solutions? Thanks!

I had this same issue, and I solved it by locating the library relative to argv[0]. But that only works if you invoke the program by its absolute path -- i.e., not via $PATH in the shell. So I invoke my program by a one-line script in ~/bin, or any other directory that's in your search path:
exec /wherever/bin/program "$#"
When the program is run, argv[0] is set to "/wherever/bin/program", and it knows to look in "/wherever/lib" for the related scripts.
Of course if you're installing directly into standard locations, you can rely on the standard directory structure, such as /usr/local/bin/program for the executable and /etc/program for related scripts & config files. The technique above is just when you want to be able to install a whole bundle in an arbitrary place.
EDIT: If you don't want the one-line shell script, you can also say:
alias program=/wherever/bin/program

Building CLI scripts in Clojure

What are the common/standard ways to build CLI scripts in Clojure?
In my view such a method should include the following characteristics:
A way of easily dealing with arguments, stdin/out/err.
Without taking too much to boot (ideally having some sort of JIT), otherwise one loses the purpose of hacking things together in one's shell.
Also it is reasonable to expect a easy way of including one time dependencies without setting up a project (maybe installing them globally).
Ideally, providing a simple example of the solution usage would be much appreciated. Somewhat equivalent to:
#!/bin/bash
echo "$#"
cat /dev/stdin
Note: I'm aware that this question was somewhat questioned previously here. But the question is incomplete and the answers don't reach a consensus neither a significant proportion of the solutions that seems to exist.

Now that there is new CLI tooling it is possible to create a standalone Clojure script without using third party tools. Once you've got the clj command line tool installed, a script like the one below should just work.
In terms of the original question, this can be as good as any Clojure/JVM CLI program at dealing with command line arguments and system input/output depending on what libraries you :require. I've haven't benchmarked it, so I won't comment on performance but if it worries you then please experiment yourself to see if startup time is acceptable to you. I would say this scores highly on dependency management though, as the script is entirely standalone (apart from the clj tool which is now the recommended way to run Clojure anyway).
File: ~/bin/script.sh
#!/bin/sh
"exec" "clj" "-Sdeps" "{:deps,{hiccup,{:mvn/version,\"1.0.5\"}}}" "$0" "$#"
(ns my-script
(:require
[hiccup.core :as hiccup]))
(println
(hiccup/html
[:div
[:span "Command line args: " (clojure.string/join ", " *command-line-args*)]
[:span "Stdin: " (read-line)]]))
Then ensure it is executable:
$ chmod +x ~/bin/script.sh
And run it:
$ echo "stdin" | script.sh command line args
<div><span>Command line args: command, line, args</span><span>Stdin: stdin</span></div>
NB. This is primarily a shell script which treats the strings on line three as commands to execute. That subsequent execution will run the clj command line tool with the given arguments, which will evaluate those strings as strings (without side effects) and then proceed to evaluate the Clojure code below.
Note also that dependencies are specified as a map passed to clj on line three. You can read more about how that works on the Clojure website. The tokens in the dependency map are separated by commas, which Clojure treats as whitespace but which most shells do not.
Thanks to the good folk on the #tools-deps channel of the "clojurians" Slack group whence this solution came.

An option would be Planck which runs on MacOS and Linux. It uses self-hosted ClojureScript, has fast startup and targets JavaScriptCore.
It has a nice SDK and mimics some things from Clojure which you do not have in ClojureScript, e.g. planck.io resembles clojure.java.io. It supports loading dependencies via tools.deps.alpha/deps.edn.
Echoing stdin is as easy as:
(require '[planck.core :refer [*in* slurp]])
(print (slurp *in*))
and printing the command line arguments:
(println *command-line-args*)
...
$ echo "foo" | planck stdin.cljs 1 2 3
foo
(1 2 3)
An example of a standalone script, i.e. not a project, with dependencies: the tree command line tool in Planck.
One caveat is that Planck doesn't support using npm dependencies. So if you need those, go for Lumo which targets NodeJS.
A third option would be joker which is a Clojure interpreter written in Go.

I know you asked for non project creating methods to accomplish this but as this specific issue has been on my mind for quite some time I figured I would throw in another alternative.
TLDR: jump to the "Creating an Executable CLI Command" section below
Background
I had pretty much the same list of requirements as you do a while back and landed on creating executable jar files. I'm not talking about executable via java -jar myfile.jar, but rather self-contained uber-jars which you can execute directly as you would with any other binary file.
If you read the zip file specification (which jar files adher to as a jar file is a zip file), it turns out this is actually possible. The short version is that you need to:
build a fat jar with the stuff you need
insert a bash / bat / shell script into the binary jar content at the beginning of your file
chmod +x the uber jar file (or if on windows, check the executable box)
rewrite the jar file meta data records so that the inserted script text does not invalidate the zip file internal offsets
It should be noted that this is actually supported by the zip file specification. This is how self extracting zip files etc work and the resulting fat jar (after the above process) is still a valid jar file and a valid zip archive. All relevant commands such as java -jar still work and the file is now also executable directly from the command line.
In addition, following the above pattern it is also possible to add support for things like the drip jvm launcher which greatly accelerates the startup times of your cli scripts.
As it turns out when I started looking into this about a year ago, a library for the last point of rewriting the jar file meta data did not exist. Not just in clojure but on the JVM as a whole. This still blows my mind: the central deployment unit of all languages on the jvm is the jar file and there was no library out there that actually read the internals of jar files. Internals as in the actual zip file structure, not just what java's ZipFile and friends does.
Furthermore, I could not find a library for clojure which dealt with the kind of binary structure the zip file specification required in a clean way.
Solution:
octet has what I consider the cleanest interface of the available binary libraries for clojure, so I wrote a pull request for octet adding support for the features required by the zip file specification.
I then created a new library clj-zip-meta which reads and interprets the zip file meta data and is capable of the offset rewriting described in the last point above.
I then created a pull request to an existing clojure lib lein-binplus to add support for the zip meta rewriting implemented by clj-zip-meta and also add support for custom preamble scripts to be able to create real executable jars without the need for java -jar.
After all this I created a leiningen template cli-cmd to support creating cli command projects which support all the above bells and whistles and has a well structured command line parsing setup...or what I considered well structured : ). Comments welcomed.
Creating an Executable CLI Command
So with all that, you can create a new command line clojure app with leiningen and run it using:
~> lein new cli-cmd mycmd
~> cd mycmd
~> lein bin
Compiling mycmd.core
Compiling mycmd.core
Created /home/mbjarland/tmp/clj-cmd/mycmd/target/mycmd-0.1.0-SNAPSHOT.jar
Created /home/mbjarland/tmp/clj-cmd/mycmd/target/mycmd-0.1.0-SNAPSHOT-standalone.jar
Creating standalone executable: /home/mbjarland/tmp/clj-cmd/mycmd/target/mycmd
Re-aligning zip offsets
~> target/mycmd
---- debug output, remove for production code ----
options {:port 80, :hostname "localhost", :verbosity 0}
arguments []
errors nil
summary
-p, --port PORT 80 Port number
-H, --hostname HOST localhost Remote host
--detach Detach from controlling process
-v Verbosity level; may be specified multiple times to increase value
-h, --help
--------------------------------------------------
This is my program. There are many like it, but this one is mine.
Usage: mycmd [options] action
Options:
-p, --port PORT 80 Port number
-H, --hostname HOST localhost Remote host
--detach Detach from controlling process
-v Verbosity level; may be specified multiple times to increase value
-h, --help
Actions:
start Start a new server
stop Stop an existing server
status Print a server's status
Please refer to the manual page for more information.
Error: invalid action '' specified!
Where the output from the command is just the boilerplate sample command line parsing I've added to the leiningen template.
The custom preamble script is located at boot/jar-preamble.sh and it has support for drip. In other words, if you have drip on your path, the generated executable will use it, otherwise it will fall back to standard java -jar way of launching the uber jar internally.
The source for the command line parsing and the code for the cli app live under the src directory as per normal.
If you feel like hacking, it is possible to change the preamble script and re-run lein bin and the new preamble will be inserted into your executable by the build process.
Also it should be noted that this method still does java -jar under the covers so you do need java on your path.
Ayway, long-winded explanation, but hopefully it will be of some use for somebody with this problem.

Consider Lumo, a ClojureScript environment which was specially designed for scripting.
Note that while it supports both ClojureScript (JAR) and NPM dependencies, the dependency support is still under development.

I write a number of Clojure (JVM) scripts, and use a the CLI-matic library https://github.com/l3nz/cli-matic/ to abstract most of the boilerplate that goes with command-line parsing, creation and maintenance of help, errors, etc.

flatpak-builder with local sources and dependancies

How I can build local sources and dependancies with flatpak-builder?
I can build local sources
flatpak build ../dictionary ./configure --prefix=/app
I can extract and build application with dependancies with a .json
flatpak-builder --repo=repo dictionary2 org.gnome.Dictionary.json
But no way to build dependancies and local sources? I don't find sources type
like dir or other, only archive, git (no hg?) ...

flatpak-builder is meant to automate the whole build process, with a single entry-point: the JSON manifest.
Everything else it obtains from Git, Bazaar or tarballs. Note that for these the "url" property may be a local URL starting with file://.
(There is indeed no support for Hg. If that's important for you, feel free to request it.)
In addition to that, there are a few more source types (see the flatpak-manifest(5) manpage), which can be used to modify the extracted sources:
file which point to a local file to copy somewhere in the extracted sources;
patch which point to a local patch file to apply to the extracted sources;
script which creates a script in the extracted sources, from an array of commands;
shell which modifies the extracted sources by running an array of commands;
Adding a dir source type might be useful.
However (and I only flatpaked a few apps, and contributed 2 or 3 patches to the code, so I might be completely wrong) care must be taken as this would easily make builds completely unreproducible, which is one thing flatpak-builder tries very hard to enable.
For example, when using a local file source, flatpak-builder will base64-econde the content of that file and use it as a data:text/plain;charset=utf8;base64,<content> URL for the file which it stores in the manifest included inside the final build.
Something similar might be needed for a dir source (tar the folder then base64-encode the content of the tar?), otherwise it would be impossible to reproduce the build. I've just been told (after submitting this answer) that this changed in Git master, in favour of a new flatpak-builder --bundle-sources option. This would probably make it easier to support reproducible builds with a dir source type.
In any case, feel free to start the conversation around a new dir source type in the upstream bug tracker. :)

There's a expermental cli tool if you want to use it https://gitlab.com/csoriano/flatpak-dev-cli
You can read the docs
http://docs.flatpak.org/en/latest/building-simple-apps.html
http://docs.flatpak.org/en/latest/flatpak-builder.html
In a nutshell this is what you need to use flatpak as develop workbench
https://github.com/albfan/gnome-builder/wiki/flatpak

Running NotePad++ from Command line with Compare Plugin showing compare result

I am trying to find a way to call notepad++ from command line with compare plugin showing the compare result providing I pass 2 files name which I want to compare.
Think like I have a batch file, which does some work and result is opening notepad++ showing 2 files in compare mode. (Yes, compare plugin is installed)
If anyone has any other suggestion to using any other editor or software also welcome..

tl;dr:
The command is Notepad++\plugins\ComparePlugin\compare.exe file1 file2.
Details:
Download the compare plugin https://bitbucket.org/uph0/compare/downloads/ComparePlugin.v1.5.6.6.bin.zip. Installing the compare plugin from the plugin manager within Notepad++ does not install the requisite exe. I assume you could also build from source to obtain the exe.
Follow the manual installation instructions in the readme:
To install manually, copy ComparePlugin.dll and ComparePlugin subfolder
into the plugins directory C:\Program Files\Notepad++\Plugins.
For a portable Notepad++ installation, you need to run the command from a directory above the notepad++ directory (or with absolute path of exe), otherwise you get an error that Notepad++.exe is not found.
The commands look like this:
>cd C:\portapps\Notepad++
>cd ..
>Notepad++\plugins\ComparePlugin\compare.exe C:\files\file1.txt C:\files\file2.txt
ufo's answer put me on the right track but it did not contain the commands to run.

There's a tool called NppCompareLoader doing exactly what you want. Simply drop it in the N++ installation folder. I'm using it since many years as a diff viewer for TortoiseSVN and TortoiseGit, thus you should certainly be able to call it right from command line.
/EDIT
Since the (unofficial) Compare-plug-in version 1.5.6.6 the additional loader mentioned above isn't required anymore. There's already one included in the plug-in. Here's the regarding change-log fragment:
NEW: Loader for using N++ as an external diff viewer (e.g. in TortoiseSVN, TortoiseGit, ..)

How to bundle C/C++ code with C-shell-script?

I have a C shell script that calls two
C programs - one after the another
with some file handling before,
in-between and afterwards.
Now, as such I have three different files - one C shell script and 2 .c files.
I need to give this script to other users. The problem is that I have to distribute three files - which the users must keep in the same folder and then execute the script.
Is there some better way to do this?
[I know I can make one C code file out of those two... but I will still be left with a shell script and a C code. Actually, the two C codes do entirely different things... so I want them to be separate]

Sounds like you're worried that your users aren't savy enough to figure out how to resolve issues like command not found errors and the like. If absolutely MUST hide "complexity" of a collection of files you could have your script create the other files. In most other circumstances I would suggest that this approach is only going to increase your support workload since semi-experienced users are less likely to know how to troubleshoot the process.
If you choose to rely on the presence of a compiler on the system that you are running on you can store the C code as a collection of cat $STRING >> file.c commands to to create your two C files, which you then compile and use.
If you would want to use pre-compiled programsn instead then the same basic process can be used except instead use xxd to both generate the strings in your script and reverse the conversion process to give you working binaries. Note: Remember to chmod the binary so that it is executable.

use shar command to create self-extracting archive.
or better yet use unzipsfx with AUTORUN option.
This provides users with ONE file, and only ONE command to execute (as opposed to one for untarring and one for execution).
NOTE: The unzip command to run should use "-n" option, that way only the first run would extract the files and the subsequent would skip the extraction.

Use a zip or tar file? And you do realize that .c files aren't executable, you need to compile & link them first?

You can include the c code inside the shell script as a here document:
#!/bin/bash
cat > code.c << EOF
line #1
line #2
...
EOF
# compile
# execute
If you want to get fancy, you can test for the existence of the executable and skip compiling them if they exists.
If you are doing much shell programming, the rest of the Advanced Bash-Scripting Guide is worth looking at as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js