C/C++ Framework for distributed computing in a dynamic cluster

C/C++ Framework for distributed computing in a dynamic cluster - c++

I am looking for a framework to be used in a C++ distributed number crunching application.
The setup looks as follows:
There is a master node which divides the problem domain into small independent tasks. The tasks are distibuted to worker nodes of different capability (e.g. CPU type/GPU-enabled).
Worker nodes are dynamically added to the compute grid, as they become available. It may also happen that a worker node dies, without saying good bye.
I am searching for a fast C/C++ framework to accomplish this setup.
To summarize, my main requirements are:
Worker/Task-scheduling paradigm
Dynamically add/remove nodes
Target network: 1G - 10G ethernet (corporate network, good performance over internet not required)
Optional: Encrypted and authenticated communication

You can certainly do what you want with MPI. MPI-2 added dynamic process management features, and I think most of the currently widely-used implementations offer these.
One of the advantages of using C++ + MPI is that the combination is quite widely used in scientific and technical computing, though my impression is that within this niche dynamic process management is not used very much. Since MPI is used on the very largest supercomputers tackling the bleeding-edge problems of computational science, one might hazard a guess that it would be fast enough for your purposes.
One of the disadvantages of using C++ + MPI is that MPI was not designed to tolerate failure of processes during execution. There is debate on SO about whether or not the dynamic process management features allow you to program your own fault tolerance. But no debate that it might be difficult.
You would get the first 3 of your requirements 'out-of-the-box'. As for encrypted and authenticated communication, you'd have to do most of that yourself, MPI just passes messages around. I'd guess that for most MPI users, running parallel applications on clusters or supercomputers with private interconnects (often themselves isolated from corporate or enterprise networks), encryption and authentication are matters of little concern.

Related

Erlang concurrency/distribution - then vs. now

As to the question of how many nodes can be in an erlang system on a practical (not theoretical) level, I've seen answers ranging from 100 in most cases, to one answer which stated "150-200 max."
I was surprised on seeing this, because wasn't erlang designed for massive concurrency and distribution in order to implement telecom networks, phone switches, etc? If so, wouldn't you assume (I know I did) that this would entail more than 100 nodes in a system (I always assumed in the hundreds, possibly thousands)?
I guess my question is: What was considered "massive concurrency/distribution" back when these old telecoms used erlang? How many machines would they typically have connected together, running erlang and doing concurrency?
Just curious, and thanks for any answers.

you got the answer, for a cluster of node, with current technology, a practical limit is from 100 to 200 nodes: because we are speaking of almost transparent distribution. The reason for this limitation are explained in the documentation and in few words are due to the mutual survey of each nodes, so the bandwidth and resources available for your application are decreasing faster and faster.
To have more nodes, you must program the cooperation between cluster and/or single nodes. The libraries offer some facilities to do that but of course it is not transparent, and not erlang specific.
It is also recommended for security reason to avoid huge cluster: today in an erlang cluster you can do what you want in any other node without restriction.

It depends. It depends on lots of things you didn't specify or define, and I suspect that if you specified enough that a "real" answer was possible you would be disappointed because it wouldn't be useful. That's why these sorts of questions are generally discouraged.
You don't say what date range you mean by "when these old telecoms used Erlang". They still use it (it's never had traction outside of Ericsson and there was never a time when Ericsson used it significantly more than the present). Here's a video of them talking about using Erlang on their SGSN-MME: http://vimeo.com/44718243
You don't say what you mean by "an Erlang system". Is that a single machine? Erlang did not have SMP support when it started (is that the time frame you're asking about?). Do you mean concurrent processes?
Is that a single cluster using net_kernel:connect_node/1? How are you defining a cluster? Erlang clusters, by default, are a complete mesh. That limits the maximum size based on the performance limits of the network and the machine's interfaces. But you can connect nodes in a chain and then there's no limit. But if you count that as a cluster, why not count it when you use your own TCP connections instead of just net_kernel's. There are lots of Ericsson routers in use on the Internet, so we could think of the Internet as one "system" where many of its component routers are using Erlang.
In the video I linked, you can see that in the early 2000s, Ericsson's SGSN product was a single box (containing multiple machines) that could serve maybe a few thousand mobile phones simultaneously. We might assume that each connected phone had one Erlang process managing it, plus a negligible number of system processes.

c++ Distributed computing of an executable program

I was wondering if it is possible to run an executable program without adding to its source code, like running any game across several computers. When i was programming in c# i noticed a process method, which lets you summon or close any application or process, i was wondering if there was something similar with c++ which would let me transfer the processes of any executable file or game to other computers or servers minimizing my computer's processor consumption.
thanks.

Everything is possible, but this would require a huge amount of work and would almost for sure make your program painfully slower (I'm talking about a factor of millions or billions here). Essentially you would need to make sure every layer that is used in the program allows this. So you'd have to rewrite the OS to be able to do this, but also quite a few of the libraries it uses.
Why? Let's assume you want to distribute actual threads over different machines. It would be slightly more easy if it were actual processes, but I'd be surprised many applications work like this.
To begin with, you need to synchronize the memory, more specifically all non-thread-local storage, which often means 'all memory' because not all language have a thread-aware memory model. Of course, this can be optimized, for example buffer everything until you encounter an 'atomic' read or write, if of course your system has such a concept. Now can you imagine every thread blocking for synchronization a few seconds whenever a thread has to be locked/unlocked or an atomic variable has to be read/written?
Next to that there are the issues related to managing devices. Assume you need a network connection: which device will start this, how will the ip be chosen, ...? To seamlessly solve this you probably need a virtual device shared amongst all platforms. This has to happen for network devices, filesystems, printers, monitors, ... . And as you kindly mention games: this should happen for a GPU as well, just imagine how this would impact performance in only sending data from/to the GPU (hint: even 16xpci-e is often already a bottleneck).
In conclusion: this is not feasible, if you want a clustered application, you have to build it into the application from scratch.

I believe the closest thing you can do is MapReduce: it's a paradigm which hopefully will be a part of the official boost library soon. However, I don't think that you would want to apply it to a real-time application like a game.
A related question may provide more answers: https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
But as KillianDS pointed out, there is no automagical way to do this, nor does it seem like is there a feasible way to do it. So what is the exact problem that you're trying to solve?

The current state of research is into practical means to distribute the work of a process across multiple CPU cores on a single computer. In that case, these processors still share RAM. This is essential: RAM latencies are measured in nanoseconds.
In distributed computing, remote memory access can take tens if not hundreds of microseconds. Distributed algorithms explicitly take this into account. No amount of magic can make this disappear: light itself is slow.

The Plan 9 OS from AT&T Bell Labs supports distributed computing in the most seamless and transparent manner. Plan 9 was designed to take the Unix ideas of breaking jobs into interoperating small tasks, performed by highly specialised utilities, and "everything is a file", as well as the client/server model, to a whole new level. It has the idea of a CPU server which performs computations for less powerful networked clients. Unfortunately the idea was too ambitious and way beyond its time and Plan 9 remained largerly a research project. It is still being developed as open source software though.
MOSIX is another distributed OS project that provides a single process space over multiple machines and supports transparent process migration. It allows processes to become migratable without any changes to their source code as all context saving and restoration are done by the OS kernel. There are several implementations of the MOSIX model - MOSIX2, openMosix (discontinued since 2008) and LinuxPMI (continuation of the openMosix project).
ScaleMP is yet another commercial Single System Image (SSI) implementation, mainly targeted towards data processing and Hight Performance Computing. It not only provides transparent migration between the nodes of a cluster but also provides emulated shared memory (known as Distributed Shared Memory). Basically it transforms a bunch of computers, connected via very fast network, into a single big NUMA machine with many CPUs and huge amount of memory.
None of these would allow you to launch a game on your PC and have it transparently migrated and executed somewhere on the network. Besides most games are GPU intensive and not so much CPU intensive - most games are still not even utilising the full computing power of multicore CPUs. We have a ScaleMP cluster here and it doesn't run Quake very well...

OpenMP & MPI explanation

A few minutes ago I stumbled upon some text, which reminded me of something that has been wondering my mind for a while, but I had nowhere to ask.
So, in hope this may be the place, where people have hands on
experience with both, I was wondering if someone could explain
what is the difference between OpenMP and MPI ?
I've read the Wikipedia articles in whole, understood them in
segments, but am still pondering;
for a Fortran programmer who wishes one day to enter the world of
paralellism (just learning the basics of OpenMP now), what is the more
future-proof way to go ?
I would be grateful on all your comments

OpenMP is primarily for tightly coupled multiprocessing -- i.e., multiple processors on the same machine. It's mostly for things like spinning up a number of threads to execute a loop in parallel.
MPI is primarily for loosely couple multiprocessing -- i.e., a cluster of computers talking to each other via a network. It can be used on a single machine as kind of a degenerate form of a network, but it does relatively little to take advantage of its being a single machine (e.g., having extremely high bandwidth communication between the "nodes").
Edit (in response to comment): for a cluster of 24 machines, MPI becomes the obvious choice. As noted above (and similar to #Mark's comments) OpenMP is primarily for multiple processors that share memory. When you don't have shared memory, MPI becomes the clear choice.
At the same time, assuming you're going to be using multiprocessor machines (is there anything else anymore?) you might want to use OpenMP to spread the load in each machine across all its processors.
Keep in mind, however, that OpenMP is generally quite a lot quicker/easier to put into use than MPI. Depending on how much speedup you need, scaling up instead of out (i.e. a fewer machines with more processors each) can make the software development enough quicker/cheaper that it can be worthwhile even though it rarely gives the lowest price per core.

Another view, not inconsistent with what #Jerry has already written is that OpenMP is for shared-memory parallelisation and MPI is for distributed-memory parallelisation. Emulating shared-memory on distributed systems is rarely convincing or successful, but it's a perfectly reasonable approach to use MPI on a shared-memory system.
Of course, all (?) multicore PCs and servers are shared-memory systems these days so the execution model for OpenMP is widely applicable. MPI tends to come into its own on clusters on which processors communicate with each other over a network (which is sometimes called an interconnect and is often of a higher-spec than office Ethernet).
In terms of applications I would estimate that a large proportion of parallel programs can be successfully implemented with either OpenMP or MPI and that your choice between the two is probably best driven by the availability of hardware. Most of us (parallel-ists) would regard OpenMP as easier to get into than MPI, and it is certainly (I assert) easier to incrementally parallelise an existing program with OpenMP than with MPI.
However, if you need to use more processors than you can get in one box (and how many processors that is is increasing steadily) then MPI is your better choice. You may also stumble across the idea of hybrid programming -- for example if you have a cluster of multicore PCs you might use MPI between PCs, and OpenMP within a PC. I've not seen any evidence that the additional complexity of programming is rewarded by improved performance, and I've seen some evidence that it is definitely not worth the effort.
And, as one of the comments has already stated, I think that Fortran is future-proof enough in the domain of parallel, high-performance, scientific and engineering applications. The latest (2008) edition of the standard incorporates co-arrays (ie arrays which themselves are distributed across a memory system with non-local and local access) right into the language. There are even one or two early implementations of this feature. I don't yet have any experience of them and expect that there will be teething issues for a few years.
EDIT to pick up on a number of points in OP's comments ...
No, I don't think that it's a bad idea to approach parallel computing via OpenMP. I think that OpenMP and MPI (or, more accurately, the models of parallel computing that they implement) are complementary. I certainly use both, and I suspect that most professional parallel programmers do too. I hadn't done much OpenMP since leaving university about 6 years ago until about 2 years ago when multicores really started popping up everywhere. Now I probably do about equal amounts of both.
In terms of your further (self-)education I think that the book Using OpenMP by Chapman et al is better than the one by Chandra, if only because it is much more up to date. I think that the Chandra book pre-dates OpenMP 2, and the Chapman book pre-dates OpenMP 3 which is worth learning.
On the MPI side the books by Gropp et al, Using MPI and Using MPI-2 are indispensable; this is perhaps because they are (as far as I have found) the only tutorial introductions to MPI rather than because of their excellence. I don't think that they are bad, mind you but they don't have a lot of competition. I like Parallel Scientific Computing in C++ and MPI by Karniadakis and Kirby too; depending on your level of scientific computing knowledge though you may find much of the material too basic.
But what I think the field lacks entirely (hope someone can prove me wrong here ?) is a good textbook (or handful of textbooks) on the design of programs for parallel execution, something to help the experienced Fortran (in our case) programmer make the jump from serial to parallel program design. Lots of info on how to parallelise a loop or nest of loops, not so much on options for parallelising computations on structured positive semi-definite matrices (or whatever). For that level of information we have to dig quite hard into the research papers (ACM and IEEE digital libraries are well worth the modest annual costs -- if you are at an academic institution your library probably has subscriptions to these and a lot more, I'm lucky in that my employers pay for my professional society memberships and add-ons, but if they didn't I would pay myself).
As to your plans for a new lab with, say, 24 processors (CPUs ? or cores ?, doesn't really matter, just asking) then the route you take should depend on the depths of your pocket. If you can afford it I'd suggest:
-- Consider a shared-memory computer, certainly a year ago Sun, SGI and IBM could all supply a shared-memory system with that sort of number of cores, I'm not sure of the current state of the market but since you have until Feb to decide it's worth looking into. A shared-memory system gives you the shared-memory parallelism option, which a cluster doesn't, and message-passing on a shared-memory platform should run at light speed. (By the way, if you go this route, benchmark this aspect of the system, there have been some bad MPI implementations on shared-memory computers.) A good MPI implementation on a shared-memory computer (my last experience of this was on a 512 processor SGI Altix) doesn't send any messages, it just moves a few pointers around and is, consequently, blisteringly fast. Trouble with the Altix was that beyond 128 processors the memory bus tended to get overwhelmed by all the traffic; that was the time to switch to MPI on a cluster or an MPP box.
-- Again, if you can afford it, I'd recommend having a system integrator deliver you a working system, and avoid building a cluster (or whatever) yourself. If, like me, you are a programmer first and a reluctant system integrator way second, this is an easier approach and will deliver you a working system on which you can start programming far sooner.
If you can't afford the expensive options, then I'd go for as many rack-mounted servers with 4 or 8 cores per box (choice is price dependent, and maybe even 16 cores per box is worth considering today) and, today, I'd be planning for at least 4GB RAM per core. Then you need the fastest interconnect you can afford; GB Ethernet is fine, but Infiniband (or the other one whose name I forget) is finer, though the jump in price is noticeable. And you'll need a PC to act as head node for your new cluster, running the job management system and other stuff. There's a ton of excellent material on the Internet on building and running clusters, often under the heading of Beowulf, which was the name of what is considered to have been the first 'home-brew' cluster.
Now, since you have until February to get your lab up and running, fire 2 of your colleagues and turn their PCs into a mini-Beowulf. Download and install a likely-looking MPI installation (OpenMPI is good but there are others to consider and your o/s might dictate another choice). Now you can start getting ready for when the lab is ready.
PS You don't have to fire 2 people if you can scavenge 2 PCs some other way. And the PCs can be old and inadequate for desktop use they are just going to be a training platform for you and your colleagues (if you have any left). The more nearly identical they are the better.

As said above, OpenMP is certainly the easier way to program as compared to MPI because of incremental parallelization. OpenMP has been used mostly for fine grain parallelism (loop level), while MPI more of a coarse-grain parallelism (domain decomposition). Both are good ways to obtain parallel performance.
We have an OpenMP and an MPI versions of our software (Fortran) and Customers use both depending on their needs.
With the current trends in multi-core architecture, hybrid OpenMP-MPI is another viable approach.

What challenges promote the use of parallel/concurrent architectures?

I am quite excited by the possibility of using languages which have parallelism / concurrency built in, such as stackless python and erlang, and have a firm belief that we'll all have to move in that direction before too long - or will want to because it will be a good/easy way to get to scalability and performance.
However, I am so used to thinking about solutions in a linear/serial/OOP/functional way that I am struggling to cast any of my domain problems in a way that merits using concurrency. I suspect I just need to unlearn a lot, but I thought I would ask the following:
Have you implemented anything reasonably large in stackless or erlang or other?
Why was it a good choice? Was it a good choice? Would you do it again?
What characteristics of your problem meant that concurrent/parallel was right?
Did you re-cast an exising problem to take advantage of concurrency/parallelism? and
if so, how?
Anyone any experience they are willing to share?

in the past when desktop machines had a single CPU, parallelization only applied to "special" parallel hardware. But these days desktops have usually from 2 to 8 cores, so now the parallel hardware is the standard. That's a big difference and therefore it is not just about which problems suggest parallelism, but also how to apply parallelism to a wider set of problems than before.
In order to be take advantage of parallelism, you usually need to recast your problem in some ways. Parallelism changes the playground in many ways:
You get the data coherence and locking problems. So you need to try to organize your problem so that you have semi-independent data structures which can be handled by different threads, processes and computation nodes.
Parallelism can also introduce nondeterminism into your computation, if the relative order in which the parallel components do their jobs affects the results. You may need to protect against that, and define a parallel version of your algorithm which is robust against different scheduling orders.
When you transcend intra-motherboard parallelism and get into networked / cluster / grid computing, you also get the issues of network bandwidth, network going down, and the proper management of failing computational nodes. You may need to modify your problem so that it becomes easier to handle the situations where part of the computation gets lost when a network node goes down.

Before we had operating systems people building applications would sit down and discuss things like:
how will we store data on disks
what file system structure will we use
what hardware will our application work with
etc, etc
Operating systems emerged from collections of 'developer libraries'.
The beauty of an operating system is that your UNWRITTEN software has certain characteristics, it can:
talk to permanent storage
talk to the network
run in a command line
be used in batch
talk to a GUI
etc, etc
Once you have shifted to an operating system - you don't go back to the status quo ante...
Erlang/OTP (ie not Erlang) is an application system - it runs on two or more computers.
The beauty of an APPLICATION SYSTEM is that your UNWRITTEN software has certain characteristics, it can:
fail over between two machines
work in a cluster
etc, etc...
Guess what, once you have shifted to an Application System - you don't go back neither...
You don't have to use Erlang/OTP, Google have a good Application System in their app engine, so don't get hung up about the language syntax.
There may well be good business reasons to build on the Erlang/OTP stack not the Google App Engine - the biz dev guys in your firm will make that call for you.

The problems will stay almost the same inf future, but the underlying hardware for the realization is changing. To use this, the way of compunication between objects (components, processes, services, how ever you call it) will change. Messages will be sent asynchronously without waiting for a direct response. Instead after a job is done the process will call the sender back with the answer. It's like people working together.
I'm currently designing a lightweighted event-driven architecture based on Erlang/OTP. It's called Tideland EAS. I'm describing the ideas and principles here: http://code.google.com/p/tideland-eas/wiki/IdeasAndPrinciples. It's not ready, but maybe you'll understand what I mean.
mue

Erlang makes you think of the problem in parallel. You won't forget it one second. After a while you adapt. Not a big problem. Except the solution become parallel in every little corner. All other languages you have to tweak. To be concurrent. And that doesn't feel natural. Then you end up hating your solution. Not fun.
The biggest advantages Erlang have is that it got no global garbage collect. It will never take a break. That is kind of important, when you have 10000 page views a second.

MPI or Sockets?

I'm working on a loosely coupled cluster for some data processing. The network code and processing code is in place, but we are evaluating different methodologies in our approach. Right now, as we should be, we are I/O bound on performance issues, and we're trying to decrease that bottleneck. Obviously, faster switches like Infiniband would be awesome, but we can't afford the luxury of just throwing out what we have and getting new equipment.
My question posed is this. All traditional and serious HPC applications done on clusters is typically implemented with message passing versus sending over sockets directly. What are the performance benefits to this? Should we see a speedup if we switched from sockets?

MPI MIGHT use sockets. But there are also MPI implementation to be used with SAN (System area network) that use direct distributed shared memory. That of course if you have the hardware for that. So MPI allows you to use such resources in the future. On that case you can gain massive performance improvements (on my experience with clusters back at university time, you can reach gains of a few orders of magnitude). So if you are writting code that can be ported to higher end clusters, using MPI is a very good idea.
Even discarding performance issues, using MPI can save you a lot of time, that you can use to improve performance of other parts of your system or simply save your sanity.

I would recommend using MPI instead of rolling your own, unless you are very good at that sort of thing. Having wrote some distributed computing-esque applications using my own protocols, I always find myself reproducing (and poorly reproducing) features found within MPI.
Performance wise I would not expect MPI to give you any tangible network speedups - it uses sockets just like you. MPI will however provide you with much the functionality you would need for managing many nodes, i.e. synchronisation between nodes.

Performance is not the only consideration in this case, even on high performance clusters. MPI offers a standard API, and is "portable." It is relatively trivial to switch an application between the different versions of MPI.
Most MPI implementations use sockets for TCP based communication. Odds are good that any given MPI implementation will be better optimized and provide faster message passing, than a home grown application using sockets directly.
In addition, should you ever get a chance to run your code on a cluster that has InfiniBand, the MPI layer will abstract any of those code changes. This is not a trivial advantage - coding an application to directly use OFED (or another IB Verbs) implementation is very difficult.
Most MPI applications include small test apps that can be used to verify the correctness of the networking setup independently of your application. This is a major advantage when it comes time to debug your application. The MPI standard includes the "pMPI" interfaces, for profiling MPI calls. This interface also allows you to easily add checksums, or other data verification to all the message passing routines.

Message Passing is a paradigm not a technology. In the most general installation, MPI will use sockets to communicate. You could see a speed up by switching to MPI, but only in so far as you haven't optimized your socket communication.
How is your application I/O bound? Is it bound on transferring the data blocks to the work nodes, or is it bound because of communication during computation?
If the answer is "because of communication" then the problem is you are writing a tightly-coupled application and trying to run it on a cluster designed for loosely coupled tasks. The only way to gain performance will be to get better hardware (faster switches, infiniband, etc. )... maybe you could borrow time on someone else's HPC?
If the answer is "data block" transfers then consider assigning workers multiple data blocks (so they stay busy longer) & compress the data blocks before transfer. This is a strategy that can help in a loosely coupled application.

MPI has the benefit that you can do collective communications. Doing broadcasts/reductions in O(log p) /* p is your number of processors*/ instead of O(p) is a big advantage.

I'll have to agree with OldMan and freespace. Unless you know of a specific and improvement to some useful metric (performance, maintainability, etc.) over MPI, why reinvent the wheel. MPI represents a large amount of shared knowledge regarding the problem you are trying to solve.
There are a huge number of issues you need to address which is beyond just sending data. Connection setup and maintenance will all become your responsibility. If MPI is the exact abstraction (it sounds like it is) you need, use it.
At the very least, using MPI and later refactoring it out with your own system is a good approach costing the installation and dependency of MPI.
I especially like OldMan's point that MPI gives you much more beyond simple socket communication. You get a slew of parallel and distributed computing implementation with a transparent abstraction.

I have not used MPI, but I have used sockets quite a bit. There are a few things to consider on high performance sockets. Are you doing many small packets, or large packets? If you are doing many small packets consider turning off the Nagle algorithm for faster response:
setsockopt(m_socket, IPPROTO_TCP, TCP_NODELAY, ...);
Also, using signals can actually be much slower when trying to get a high volume of data through. Long ago I made a test program where the reader would wait for a signal, and read a packet - it would get a bout 100 packets/sec. Then I just did blocking reads, and got 10000 reads/sec.
The point is look at all these options, and actually test them out. Different conditions will make different techniques faster/slower. It's important to not just get opinions, but to put them to the test. Steve Maguire talks about this in "Writing Solid Code". He uses many examples that are counter-intuitive, and tests them to find out what makes better/faster code.

MPI uses sockets underneath, so really the only difference should be the API that your code interfaces with. You could fine tune the protocol if you are using sockets directly, but thats about it. What exactly are you doing with the data?

MPI Uses sockets, and if you know what you are doing you can probably get more bandwidth out of sockets because you need not send as much meta data.
But you have to know what you are doing and it's likely to be more error prone. essentially you'd be replacing mpi with your own messaging protocol.

For high volume, low overhead business messaging you might want to check out
OAMQ with several products. The open source variant OpenAMQ supposedly runs the trading at JP Morgan, so it should be reliable, shouldn't it?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js