why mpirun duplicates the program by default? - c++

I am new to openMPI, I have problem understanding the concepts. (I found this pretty helpful)
1- Could anyone breifly explain why we use openMPI? To my understanding, OpenMPI is used to parallelize those sections of the code which can run in parallel.
2- why mpirun duplicates a single program? simply because my laptop is dual core?
3 - what changes in the code I need to apply to make it run correctly? I mean ONE program parallelized on two available cores? not 2 similar threads of the same program.

MPI is primarily of benefit when used in a multiple machine environment, in which you must run multiple processes.
It requires heavy modification of the program.

Related

How to run a parallel program from within a fortran code already launched with srun on SLURM?

I think my question is pretty specific and niche, and couldn't find an answer anywhere else.
I have a parallel code in Fortran (using MPI), and I would like a subroutine on each individual processor to call another (in principle serial) program during runtime. I do this with EXECUTE_COMMAND_LINE. Now it turns out the other code I'm calling is also parallelized, with no possibility of producing a purely serial version without MPI. In my SLURM file, the cluster is set up such that I have to use srun, so
srun ./mycode < input.in > output.out
calls my code. In the 3rd party code, however, the easiest way to specify the number of cores is to use the provided launcher, which itself uses mpirun to launch the right number of nodes.
In principle, it is possible to run the 3rd party code without mpirun, in which case it should launch a "serial" version (parallel version but on a single core). However, as my code is already being run with srun, it looks like this is triggering the parallel version of the 3rd party software to run on multiple processors, which is ruining what I'm trying to do with this. If I use the normal launcher that calls mpirun to invoke the 3rd party code, everything hangs because mpirun is waiting for the first instance of srun to complete, which it never will.
Is there any way I can specify to the 3rd party code (that doesn't have a flag to specify this explicitly without invoking mpirun) to run on a single processor? Perhaps an environment variable I can set, or a way of using EXECUTE_COMMAND_LINE that would specify the number of cores to run the command on? Or even a way to make multiple mpirun commands interact with preventing each other from running?
I use Intel compilers and MPI versions for everything.
A colleague found one way to do this for anyone struggling:
call execute_command_line("bash -lc 'env -i PATH=/usr/bin/:/bin mpirun -n 2 ./bin/slave &> slave.out &'", wait=.false.)
Executed from within the calling fortran code.

Detecting not using MPI when running with mpirun/mpiexec

I am writing a program (in C++11) that can optionally be run in parallel using MPI. The project uses CMake for its configuration, and CMake automatically disables MPI if it cannot be found and displays a warning message about it.
However, I am worrying about a perfectly plausible use case whereby a user configures and compiles the program on an HPC cluster, forgets to load the MPI module, and does not notice the warning. That same user might then try to run the program, notice that mpirun is not found, include the MPI module, but forget to recompile. If the user then runs the program with mpirun, this will work, but the program will just run a number of times without any parallelization, as MPI was disabled at compile time. To prevent the user from thinking the program is running in parallel, I would like to make the program display an error message in this case.
My questions is: how can I detect that my program is being run in parallel without using MPI library functions (as MPI was disabled at compile time)? mpirun just launches the program a number of times, but does not tell the processes it launches about them being run in parallel, as far as I know.
I thought about letting the program write some test file, and then check if that file already exists, but apart from the fact that this might be tricky to do due to concurrency problems, there is no guarantee that mpirun will even launch the various processes on nodes that share a file system.
I also considered using a system variable to communicate between the two processes, but as far as I know, there is no system independent way of doing this (and again, this might cause concurrency issues, as there is no way to coordinate system calls between the various processes).
So at the moment, I have run out of ideas, and I would very much appreciate any suggestions that might help me achieve this. Preferred solutions should by operating system independent, although a UNIX-only solution would already be of great help.
Basically, you want to run a a detection of whether you are being run by mpirun etc. in your non-MPI code-path. There is a very similar question: How can my program detect, whether it was launch via mpirun that already presents one non-portable solution.
Check for environment variables that are set by mpirun. See e.g.:
http://www.open-mpi.org/faq/?category=running#mpi-environmental-variables
As another option, you could get the process id of the parent process and it's process name and compare it with a list of known MPI launcher binaries such as orted,slurmstepd,hydra??1. Everything about that is unfortunately again non-portable.
Since launching itself is not clearly defined by the MPI standard, there cannot be a standard way to detect it.
1: Only from my memory, please don't take the list literally.
From a user experience point of view, I would argue that always showing a clear message how the program is being run, such as:
Running FancySimulator serially. If you see this as part of mpirun, rebuild FancySimuilator with FANCYSIM_MPI=True.
or
Running FancySimulator in parallel with 120 MPI processes.
would "solve" the problem. A user getting 120 garbled messages will hopefully notice.

parallel processing library

I would like to know which parallel processing library to be best used under these configurations:
A single quad core machine. I would like to execute four functions of the same type on each core. The same function takes different arguments.
A cluster of 4 machines with each one with multi core. I would like to execute the same functions but n-parallel ( 4 machines * no of cores in each machine ). So I want it to scale.
Program details :
C++ program. There is no dependency between functions. The same function gets executed with different set of inputs and gets completed for > 100 times
There is no shared memory as each function takes its own data and its own inputs.
Each function need not to wait for others to complete. There is no need of join or fork.
For above scenarios what is the best parallel libs can be used? MPI, BOOST::MPI, open mp or other libs.
My preference would be BOOST::MPI but I want some recommendations. I am not sure if using MPI is allowed with parallel multi core machines?
Thanks.
What you have here is an embarassingly parallel problem (http://en.wikipedia.org/wiki/Embarrassingly_parallel). While MPI can definitely be used on a multi-core machine, it could be over kill for the problem at hand. If your tasks are completely separated, you could just compile them in to separate executables or a single executable with different inputs and use "make -j [n]" (see http://www.gnu.org/software/make/manual/html_node/Parallel.html) to execute them in parallel.
If MPI comes naturally to you, by all means, use it. OpenMP probably won't cut it if you want to control computing on separate computers within a cluster.

CUDA: running programs with OpenMP

Is it possible to run program with openMP on GPU using CUDA or something else?
I have a concurrency program, but my computer have only 2 cores.
I need to test program on 8 and more cores.
Thanks for help!
There is OpenACC which is kind of similar to OpenMP, although of course adapted to the very different asymmetric situation of CPU+GPU.
If your purpose however is to test OpenMP code, the answer is a definite NO. You can't take the same program, and it would not execute the same way anyway.
Your best bet probably is to execute the OpenMP program with OMP_NUM_THREADS=8, which will start 8 threads even if only 2 cores are available. Some aspects (e.g. lock contention) will still be different from a real 8 core system though.

Executing C++ program on multiple processor machine

I developed a program in C++ for research purpose. It takes several days to complete.
Now i executing it on our lab 8core server machine to get results quickly, but i see machine assigns only one processor to my program and it remains at 13% processor usage(even i set process priority at high level and affinity for 8 cores).
(It is a simple object oriented program without any parallelism or multi threading)
How i can get true benefit from the powerful server machine?
Thanks in advance.
Partition your code into chunks you can execute in parallel.
You need to go read about data parallelism
and task parallelism.
Then you can use OpenMP or
MPI
to break up your program.
(It is a simple object oriented program without any parallelism or
multi threading)
How i can get true benefit from the powerful server machine?
By using more threads. No matter how powerful the computer is, it cannot spread a thread across more than one processor. Find independent portions of your program and run them in parallel.
C++0x threads
Boost threads
OpenMP
I personally consider OpenMP a toy. You should probably go with one of the other two.
You have to exploit multiparallelism explicitly by splitting your code into multiple tasks that can be executed independently and then either use thread primitives directly or a higher level parallelization framework, such as OpenMP.
If you don't want to make your program itself use multithreaded libraries or techniques, you might be able to try breaking your work up into several independent chunks. Then run multiple copies of your program...each being assigned to a different chunk, specified by getting different command-line parameters.
As for just generally improving a program's performance...there are profiling tools that can help you speed up or find the bottlenecks in memory usage, I/O, CPU:
https://stackoverflow.com/questions/tagged/c%2b%2b%20profiling
Won't help split your work across cores, but if you can get an 8x speedup in an algorithm that might be able to help more than multithreading would on 8 cores. Just something else to consider.