Running MPI in a cluster of hosts of different platforms [duplicate]

Running MPI in a cluster of hosts of different platforms [duplicate] - c++

In my lab, we have several servers used for the simulation programs, but they worked independently. Now I want to combine them to become a cluster using MPICH to make them communicate. But there exists a problem, which is that these servers have different OSs. Some of them are Redhat, and some of them are Ubuntu. And on the homepage of MPICH, I saw that download sites of these two different operating systems are different, so will it be possible to set up a cluster with different operating system? And how to do it?
The reason why I don't want to reinstall these servers is that there are too many data on them and they are under used when I ask this question.

It is not feasible to get this working properly. You should be able to get the same version of an MPI implementation manually installed on different distributions. They might even talk to each other properly. But as soon you try to run actual applications, with dynamic libraries, you will get into trouble with different versions of shared libraries, glibc etc. You will be tempted to link everything statically or build different binaries for the different distributions. At the end of the day, you will just chase one issue you run into after another.
As a side node, combining some servers together with MPI does not make a High Performance Computing cluster. For instance an HPC system has sophisticated high performance interconnects and a high performance parallel file system.
Also note that your typical HPC application is going to run poorly on heterogeneous hardware (as in each node has different CPU / memory configurations).

Related

Running two IncrediBuild on same machine?

Is it possible to use IncrediBuild to simultaneously build two different clones on the same machine?
Usually the Building (not Rebuilding) of one clone in my system takes 1.5 hours, and most of the time I use two clones for two different bugs/feature. If I can run the build simultaneously on two different clones it would be very helpful.

Yes, you can use IncrediBuild to build simultaneously from two different clones (this behavior works out-of-the-box and there is nothing special you'll need to do in order to achieve that).
If you have multiple IncrediBuild Helpers (remote machines that contribute their idle CPU cycles to accelerate the builds running on the machines that initiated the build), each clone will try to use as many Helper machines as it require in order to optimize the build performance.
In your scenario, your available Helpers machines will split their idle resources between the two clones from which you initiate your builds.
We've recently introduced a beta Enterprise version of IncrediBuild that allows you (among other features) to execute two builds in parallel from the same Initiating machine (as opposed to two different clones which was always supported in the product), in a scenario in which multiple build are simultaneously executed from the same initiating machine, each such build will potentially use hundreds of cores and gigs of memory you already own across your network to accelerate the build time of both these builds.
Disclaimer, the writer of this answer is part of the IncrediBuild team.

Is there an issue in unix running one C++ binary multiple times and parallel?

I have a compiled binary of a server written in C++ on my unix system. I want to dynamically start servers using a small script. I can specify the port the server will be running on as an argument. My question is if there is a problem with starting one binary multiple times and let them run parrallel instead of copying the binary. In my test it worked as expected but I want to be sure that there are no problems.

First you can run multiple instances of the same binary on almost all operating systems, you do not need to copy it. However there is a deeper issue.
It all depends on how the application is written. In a perfect world no, you would not have any issue, but the world isn't perfect. An application might use a system wide resource and assume it has exclusive use of that resource. This is not unheard of for larger applications such as servers. You already mentioned one thing, the port, but as you said you can change that but are you sure that is the only thing? If you are sure than you can run multiple instances without an issue. However there are other resources the application might assume it has exclusive use over, files could be one, that if you run multiple copies this assumption is broken. The application would then most likely not behave as expected.

Most operating systems allow you to run as many instances of the same program as you wish. It is program's responsibility to enforce any limit on the number of instances, if any.

c++ Distributed computing of an executable program

I was wondering if it is possible to run an executable program without adding to its source code, like running any game across several computers. When i was programming in c# i noticed a process method, which lets you summon or close any application or process, i was wondering if there was something similar with c++ which would let me transfer the processes of any executable file or game to other computers or servers minimizing my computer's processor consumption.
thanks.

Everything is possible, but this would require a huge amount of work and would almost for sure make your program painfully slower (I'm talking about a factor of millions or billions here). Essentially you would need to make sure every layer that is used in the program allows this. So you'd have to rewrite the OS to be able to do this, but also quite a few of the libraries it uses.
Why? Let's assume you want to distribute actual threads over different machines. It would be slightly more easy if it were actual processes, but I'd be surprised many applications work like this.
To begin with, you need to synchronize the memory, more specifically all non-thread-local storage, which often means 'all memory' because not all language have a thread-aware memory model. Of course, this can be optimized, for example buffer everything until you encounter an 'atomic' read or write, if of course your system has such a concept. Now can you imagine every thread blocking for synchronization a few seconds whenever a thread has to be locked/unlocked or an atomic variable has to be read/written?
Next to that there are the issues related to managing devices. Assume you need a network connection: which device will start this, how will the ip be chosen, ...? To seamlessly solve this you probably need a virtual device shared amongst all platforms. This has to happen for network devices, filesystems, printers, monitors, ... . And as you kindly mention games: this should happen for a GPU as well, just imagine how this would impact performance in only sending data from/to the GPU (hint: even 16xpci-e is often already a bottleneck).
In conclusion: this is not feasible, if you want a clustered application, you have to build it into the application from scratch.

I believe the closest thing you can do is MapReduce: it's a paradigm which hopefully will be a part of the official boost library soon. However, I don't think that you would want to apply it to a real-time application like a game.
A related question may provide more answers: https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
But as KillianDS pointed out, there is no automagical way to do this, nor does it seem like is there a feasible way to do it. So what is the exact problem that you're trying to solve?

The current state of research is into practical means to distribute the work of a process across multiple CPU cores on a single computer. In that case, these processors still share RAM. This is essential: RAM latencies are measured in nanoseconds.
In distributed computing, remote memory access can take tens if not hundreds of microseconds. Distributed algorithms explicitly take this into account. No amount of magic can make this disappear: light itself is slow.

The Plan 9 OS from AT&T Bell Labs supports distributed computing in the most seamless and transparent manner. Plan 9 was designed to take the Unix ideas of breaking jobs into interoperating small tasks, performed by highly specialised utilities, and "everything is a file", as well as the client/server model, to a whole new level. It has the idea of a CPU server which performs computations for less powerful networked clients. Unfortunately the idea was too ambitious and way beyond its time and Plan 9 remained largerly a research project. It is still being developed as open source software though.
MOSIX is another distributed OS project that provides a single process space over multiple machines and supports transparent process migration. It allows processes to become migratable without any changes to their source code as all context saving and restoration are done by the OS kernel. There are several implementations of the MOSIX model - MOSIX2, openMosix (discontinued since 2008) and LinuxPMI (continuation of the openMosix project).
ScaleMP is yet another commercial Single System Image (SSI) implementation, mainly targeted towards data processing and Hight Performance Computing. It not only provides transparent migration between the nodes of a cluster but also provides emulated shared memory (known as Distributed Shared Memory). Basically it transforms a bunch of computers, connected via very fast network, into a single big NUMA machine with many CPUs and huge amount of memory.
None of these would allow you to launch a game on your PC and have it transparently migrated and executed somewhere on the network. Besides most games are GPU intensive and not so much CPU intensive - most games are still not even utilising the full computing power of multicore CPUs. We have a ScaleMP cluster here and it doesn't run Quake very well...

Possible to distribute an MPI (C++) program accross the internet rather than within a LAN cluster?

I've written some MPI code which works flawlessly on large clusters. Each node in the cluster has the same cpu architecture and has access to a networked (i.e. 'common') file system (so that each node can excecute the actual binary). But consider this scenario:
I have a machine in my office with a dual core processor (intel).
I have a machine at home with a dual core processor (amd).
Both machines run linux, and both machines can successfully compile and run the MPI code locally (i.e. using 2 cores).
Now, is it possible to link the two machines together via MPI, so that I can utilise all 4 cores, bearing in mind the different architectures, and bearing in mind the fact that there are no shared (networked) filesystems?
If so, how?
Thanks,
Ben.

Its possible to do this. Most MPI implementations allow you to specify the location of the binary to be run on different machines. Alternatively, make sure that it is in your path on both machines. Since both machines have the same byte order, that shouldn't be a problem. You will have to make sure that any input data that the individual processes read is available in both locations.
There are lots of complications with doing this. You need to make sure that the firewalls between the systems will allow process startup and communication. Communication between the machines is going to be much slower, so if you code is communication heavy or latency intolerant, it probably will be quite slow. Most likely your execution time running on all 4 cores will be longer than just running with 2 on a single machine.

There is no geographical limitation on where the processes are located. And as KeithB said, there is no need to have common path or even the same binary on both the machines. Depending on what MPI implementation you are using, you dont even need the same endian-ness.
You can specify exactly the path to the binary on each machine and have two independent binaries as well. However, you should note the program will run slow if the communication infrastructure between the two nodes is not fast enough.

Best hardware/software solution for parallel makes?

We've got a Linux-based build system in which a build consists of many different embedded targets (with correspondingly different drivers and featuresets enabled), each one built with another single main source tree.
Rather than try to convert our make-based system to something more multiprocess-friendly, we want to just find the best way to fire off builds for all of these targets simultaneously. What I'm not sure about is how to get the best performance.
I've considered the following possible solutions:
Lots of individual build machines. Downsides: lots of copies of the shared code, or working from a (slow) shared drive. More systems to maintain.
A smaller number of multiprocessor machines (dual quadcores, perhaps), with fast striped RAID local storage. Downsides: I'm unsure of how it will scale. It seems that the volume would be the bottleneck, but I don't know how well Linux handles SMP these days.
A similar SMP machine, but with a hypervisor or Solaris 10 running VMware. Is this silly, or would it provide some scheduling benefits? Downsides: Doesn't address the storage bottleneck issue.
I intend to just sit down and experiment with these possibilities, but I wanted to check to see if I've missed anything. Thanks!

As far as software solutions go, I can recommend Icecream. It is maintained by SUSE and builds on distcc.
We used it very successfully at my previous company, which had similar build requirements to what you describe.

If you're interested in fast incremental performance, then the cost of calculating the files which need to be rebuilt will dominate over the actual compile time and this will put higher demands on efficient I/O between the machines.
However, if you're mostly interested in fast full rebuilds (nightly builds, for example), then you may be better of rsyncing out the source tree to each build slave, or even have each build slave check out its own copy from source control. A CI-system such as Hudson would help to manage each of the slave build servers.

If your makefiles are sufficiently complete and well-structured, the -j flag may also be useful in overcoming I/O bottlenecks, if your build machine(s) have enough memory. This lets make run multiple independent tasks in parallel, so that your CPUs will ideally never block waiting on I/O. Generally, I've found good results by allowing several more tasks than I have CPUs in a machine.
It's not clear from your question if your current makefiles are not amenable to this, or if you just don't want to jump to something entirely different than make.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js