Can a OpenMP C++ program be used as mapper/reducer function in Hadoop? - c++

Can we combine OpenMP and MapReduce something like this:
Map/Reduce can be used to distribute the data set among different computers.
Then each computer runs mapper/reducer function that take advantage of multiprocessing
using OpenMP.
Is this possible? (I couldn't find anything substantial on google search).
If this possible, would there be any advantage of this?
P.S. I'm using Hadoop Streaming Utility.

The point of Hadoop is to have processing nodes deal with data locality automatically and transparently for you.
If I understand your correctly you want to use Hadoop just for storage, and then do your Map/Reduce work in OpenMP. While this should be possible, you will end up losing one of the main design advantages of Hadoop.
This approach does not make a lot of sense. I suggest either sticking to the Hadoop framework, or looking at one of the alternatives if you don't like it.

Related

Can I mix several threads into one using modules in Wowze?

Need to know if it is possible to mix several rtmp threads into one using modules? By Picture-in-Picture type, or simply there is a main stream, and additional streams add sound. It's the mixing that I need so that all the streamers can be heard at the same time.
I need to see if that's possible in principle, so I don't have to spend time on investigating if that's not possible. And what skills does a specialist need to have in order to create a module like this?
Wowza Streaming Engine does not support compositing today (4.8.5), so I would recommend merging the streamings prior to ingesting them as a single source.

Distributed computation in Clojure

My new project assignment is to extend existing distributed architecture with a new module for some mathematical calculations with a REST API front-end. The system is written in Java, and ZeroMQ is used for inter-process communication.
I would like to write at least parts of the new module in Clojure. Technically, it will consist of at least 2 submodules, one for calculations per se, another one for sorting and filtering the results of those calculations. Basic requirement is for this system to support distributed computation, so that it can run on as many machines as required for proper performance. Initial advice was to use Apache Storm.
Would Storm work for designing the system with many submodules executing different types of tasks? What other libraries exist in order to make this possible for Clojure-based computation nodes?
If possible, i'd be also very happy to hear your general advice on how to approach this kind of application design with Clojure.
Thanks!

Internet connection speed vs. Programming language speed for HTTP Requests?

I know how to program in Python but I am also interested in learning C++. I have heard that it is much faster than python and for the programs I am writing currently, I would prefer them to run as quickly and efficiently as possible. I know that a lot of that comes from just writing good code but I was also wondering if using another language, such as C++, would help.
While I was pondering this, I realized that since most of my programs will be mainly using the internet (as in implementing Google APIs and using the information from them to submit data to other websites) then maybe the speed of the language doesn't matter if the speed of my internet connection is always going to be relatively the same. I have two ways I am connecting to the internet: Selenium (or some kind of automated browser) for things that require a browser, and just HTTP requests.
How much difference would I see between python and a different language even though the major focus of my programs is on the internet?
Thanks.
Scenarios
The main benefit you would get from using language that is compiled to machine code is that you can do lots of byte and bit-magic. Lets say, modifying image data, transforming audio, analysing indices of a genomic sequence database.
Typical tasks
Serving web-pages you typically have problems if a completely different sort: You will be loading a resource from hard disk, serve them directly if its an image or audio, or you will be executing different transformation steps on a text resource until it becomes the final HTML document. The latter will be using template engines, database queries, and so on.
If you look at that you can see that most of the things, say 90-99% are pretty high-level stuff -- in Python you will use an API that is optimized by many, many users for optimal performance (meaning: time and space). "Open a file" will be almost as fast in C as it is in Python, so is reading from it and serving it to some Socket. Transforming text data could be a bit faster in C++ then it is in Python, but... how fast does it have to be? A use is very likely willing to wait 200ms, isnt't he? And that is a lot of time for a nice high-level template engine to transform a bit of text.
What C++ and Python can do for you
A typical Python web-service is much faster to write and a easier to deploy then a server written in C++. If you would do it in C++ you firstly need to handle sockets and connections -- and for those would either use an existing library or write your own handling. If you use an existing library (which I strongly recommend) you are basically not doing anything differently then Python does. If you write your own handling, you have many, many low-level things you can do wrong that will burn the performance you wish for. No, that is not an option.
If you need speed, and Python and the server and template framework is not enough you should re-think your architectural approach. Then take a look at the c10k-problem and write tiny pieces in C. (Look at this c10k very hot topic, too) But I can not see many reasons not to use a high-level language like Python, if you are only looking for performance in a medium-complex web-serving application.
Summary: The difference
If you are just serving files from the hard-drive I guess your Python program would even be faster then your hand-crafted C++-server. If you use a framework written in C or C++ and just drop in your static pages, I guess you get a boost like 2-5fold against Python. Then again, if your web-application is a bit more complex then serving static content, I estimate that the difference will diminish very quickly and you will get 1-2fold speed gain at most.
It's not all about speed...
One note about another difference between C++ and Python one should not forget: Since C++ is really compiled and not as dynamic as Python you would gain a lot of static error analysis by using Python. Writing correct code is always difficult, but can be done in C++ and Python with good tests and static analysis -- the latter is simpler in C++ (my opinion). If that is an issue for you, you may think again, but you asked about speed.

Concurrency model: Erlang vs Clojure

We are going to write a concurrent program using Clojure, which is going to extract keywords from a huge amount of incoming mail which will be cross-checked with a database.
One of my teammates has suggested to use Erlang to write this program.
Here I want to note something that I am new to functional programming so I am in a little doubt whether clojure is a good choice for writing this program, or Erlang is more suitable.
Do you really mean concurrent or distributed?
If you mean concurrent (multi-threaded, multi-core etc.), then I'd say Clojure is the natural solution.
Clojure's STM model is perfectly designed for multi-core concurrency since it is very efficient at storing and managing shared state between threads. If you want to understand more, well worth looking at this excellent video.
Clojure STM allows safe mutation of data by concurrent threads. Erlang sidesteps this problem by making everything immutable, which is fine in itself but doesn't help when you genuinely need shared mutable state. If you want shared mutable state in Erlang, you have to implement it with a set of message interactions which is neither efficient nor convenient (that's the price of a nothing shared model....)
You will get inherently better performance with Clojure if you are in a concurrent setting in a large machine, since Clojure doesn't rely on message passing and hence communication between threads can be much more efficient.
If you mean distributed (i.e. many different machines sharing work over a network which are effectively running as isolated processes) then I'd say Erlang is the more natural solution:
Erlang's immutable, nothing-shared, message passing style forces you to write code in a way that can be distributed. So idiomatic Erlang automatically can be distributed across multiple machines and run in a distributed, fault-tolerant setting.
Erlang is therefore very well optimised for this use case, so would be the natural choice and would certainly be the quickest to get working.
Clojure could do it as well, but you will need to do much more work yourself (i.e. you'd either need to implement or choose some form of distributed computing framework) - Clojure does not currently come with such a framework by default.
In the long term, I hope that Clojure develops a distributed computing framework that matches Erlang - then you can have the best of both worlds!
The two languages and runtimes take different approaches to concurrency:
Erlang structures programs as many lightweight processes communicating between one another. In this case, you will probably have a master process sending jobs and data to many workers and more processes to handle the resulting data.
Clojure favors a design where several threads share data and state using common data structures. It sounds particularly suitable for cases where many threads access the same data (read-only) and share little mutable state.
You need to analyze your application to determine which model suits you best. This may also depend on the external tools you use -- for example, the ability of the database to handle concurrent requests.
Another practical consideration is that clojure runs on the JVM where many open source libraries are available.
Clojure is Lisp running on the Java JVM. Erlang is designed from the ground up to be highly fault tolerant and concurrent.
I believe the task is doable with either of these languages and many others as well. Your experience will depend on how well you understand the problem and how well you know the language. If you are new to both, I'd say the problem will be challenging no matter which one you choose.
Have you thought about something like Lucene/Solr? It's great software for indexing and searching documents. I don't know what "cross checking" means for your context, but this might be a good solution to consider.
My approach would be to write a simple test in each language and test the performance of each one. Both languages are somewhat different to C style languages and if you aren't used to them (and you don't have a team that is used to them) you may end up with a maintenance nightmare.
I'd also look at using something like Groovy 1.8. Groovy now includes GPars to enable parallel computing. String and file manipulation in Groovy is very easy indeed.
It depends what you mean by huge.
Strings in erlang are painful..
but:
If huge means tens of distributed machines, than go with erlang and write workers in text friendly languages (python?, perl?). You will have distributed layer on the top with highly concurrent local workers. Each worker would be represented by erlang process. If you need more performance, rewrite your worker into C. In Erlang it is super easy to talk to another languages.
If huge still means one strong machine go with JVM. It is not huge then.
If huge is hundreds of machines, I think you will need something stronger google-like (bigtable, map/reduce) probably on C++ stack. Erlang still OK, however you will need good devs to code it.

Something similar to ParallelPython for C++?

I need to do some extensive searching and string comparisons and for this I figure that a compiled program is much better than an interpreted ones especially after seeing some comparison studies. I came across ParallelPython which was beautiful. It has autodiscovery for clusters and can pretty much do all the load balancing for me as well.
My first question is, is it a good idea to just go ahead with Python on a cluster having 20 nodes or do I switch to C++? If I need to switch then is there a good alternative to ParallelPython for C++ that provides features like load balancing and autodiscovery for node?
I would suggest OpenMPI. I do not know what ParallelPython does exactly, but OpenMPI is an open API for cluster computing, and I imagine it will provide the requested functionality.
You can always use ParallelPython for your high level work, and call into C++ code for the "hard-core" processing, as needed.
That being said, there are options in the C++ world. The most common cluster-based technology is MPI. Some implementations provide load balancing and auto-discovery, though it's not in the core spec.