I am trying to write a hybrid OpenMP/MPI-program, and am therefore trying to understand the correlation between the number of OpenMP-Threads and MPI-processes. Therefore, I created a small test program:
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
int main(int args, char *argv[]) {
int rank, nprocs, thread_id, nthreads, cxx_procs;
MPI_Init(&args, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel private(thread_id, nthreads, cxx_procs)
{
thread_id = omp_get_thread_num();
nthreads = omp_get_num_threads();
cxx_procs = std::thread::hardware_concurrency();
std::stringstream omp_stream;
omp_stream << "I'm thread " << thread_id
<< " out of " << nthreads
<< " on MPI process nr. " << rank
<< " out of " << nprocs
<< ", while hardware_concurrency reports " << cxx_procs
<< " processors\n";
std::cout << omp_stream.str();
}
MPI_Finalize();
return 0;
}
which is compiled using
mpicxx -fopenmp -std=c++17 -o omp_mpi source/main.cpp -lgomp
with gcc-9.3.1 and OpenMPI 3.
Now, when executing it on an i7-6700 with 4c/8t with ./omp_mpi, I get the following output
I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
i.e. as expected.
When executing it using mpirun -n 1 omp_mpi I would expect the same, but instead I get
I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
Where are the other threads? When executing it on two MPI-processes instead, I get
I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
i.e. still only two OpenMP-threads, but when executing it on four MPI-processes, I get
I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
Now suddenly I get eight OpenMP-Threads per MPI-processes. Where does that change come from?
You are observing an interaction between a peculiarity of Open MPI and the GNU OpenMP Runtime libgomp.
First, the number of threads in OpenMP is controlled by the num-threads ICV (internal control variable) and the way to set it is to either call omp_set_num_threads() or by setting OMP_NUM_THREADS in the environment. When OMP_NUM_THREADS is not set and one does not call omp_set_num_threads(), the runtime is free to choose whatever it deems reasonable as a default. In the case of libgomp, the the manual says:
OMP_NUM_THREADS
Specifies the default number of threads to use in parallel regions. The value of this variable shall be a comma-separated list of positive integers; the value specifies the number of threads to use for the corresponding nested level. Specifying more than one item in the list will automatically enable nesting by default. If undefined one thread per CPU is used.
What it fails to mention is that it uses various heuristics to determine the right number of CPUs. On Linux and Windows, the process affinity mask is used for that (if you like to read code, the one for Linux is right here). If the process is bound to a single logical CPU, you only get one thread:
$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
If you bind it to several logical CPUs, their count is used:
$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
This behaviour specific to libgomp interacts with another behaviour specific to Open MPI. Back in 2013, Open MPI changed its default binding policy. The reasons are somewhat a mix of technical reasons and politics and you can read more on Jeff Squyres' blog (Jeff is a core Open MPI developer).
The moral of the story is:
Always set the number of OpenMP threads and the MPI binding policy explicitly. With Open MPI, the way to set environment variables is with -x:
$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
Note that I have hyperthreading enabled and so --bind-to core and --bind-to hwthread produce different results without explicitly setting OMP_NUM_THREADS:
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
vs
mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
--map-by node:PE=3 gives each MPI rank three processing elements (PEs) per node. When binding to core, a PE is a core. When binding to hardware threads, a PE is a thread and one should use --map-by node:PE=#cores*#threads, i.e., --map-by node:PE=6 in my case.
Whether the OpenMP runtime respects the affinity mask set by MPI and whether it maps its own thread affinity onto it, and what to do if not, is a completely different story.
The man page for mpirun explains:
If you are simply looking for how to run an MPI application,
you probably want to use a command line of the following form:
% mpirun [ -np X ] [ --hostfile <filename> ] <program>
This will run X copies of in your current
run-time environment (...)
Please note that mpirun automatically binds processes as of the
start of the v1.8 series. Three binding patterns are used in
the absence of any further directives:
Bind to core: when the number of processes is <= 2
Bind to socket: when the number of processes is > 2
Bind to none: when oversubscribed
If your application uses threads, then you probably want to ensure
that you are either not bound at all
(by specifying --bind-to none), or bound to multiple cores
using an appropriate binding level or specific number
of processing elements per application process.
Now, if you specify 1 or 2 MPI processes, mpirun defaults to --bind-to core, which results in 2 threads per MPI process.
If, however, you specify 4 MPI processes, mpirun defaults to --bind-to socket and you have 8 threads per process, as your machine is a single-socket one. I tested it on a laptop (1s/2c/4t) and a workstation (2 sockets, 12 cores per socket, 2 threads per core) and the program (with no np argument) behaves as specified above: for the workstation there are 24 MPI processes with 24 OpenMP threads each.
We have written a small C++ application which mainly does some supervising of other processes via ZeroMQ. So most of the time, the application idles and periodcally sends and receives some requests.
We built a docker image based on ubuntu which contains just this application, some dependencies and an entrypoint.sh. The entrypoint basically runs as /bin/bash, manipulates some configuration files based on environment variables and then starts the application via exec.
Now here's the strange part. When we start the application manually without docker, we get a CPU usage of nearly 0%. When we start the same application as docker image, the CPU usage goes up to 100% and blocks exactly one CPU core.
To find out what was happening, we set the entrypoint of our image to /bin/yes (just to make sure the container keeps running) and then started a bash inside the running container. From there we started entrypoint.sh manually and the CPU again was at 0%.
So we are wondering, what could cause this situation. Is there anything we need to add to our Dockerfile to prevent this?
Here is some output generated with strace. I used strace -p <pid> -f -c and waited five minutes to collect some insights.
1. Running with docker run (100% CPU)
strace: Process 12621 attached with 9 threads
strace: [ Process PID=12621 runs in x32 mode. ]
[...]
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
71.26 17.866443 144 124127 nanosleep
14.40 3.610578 55547 65 31 futex
14.07 3.528224 1209 2918 epoll_wait
0.10 0.024760 4127 6 1 restart_syscall
0.10 0.024700 0 66479 poll
0.05 0.011339 4 2902 3 recvfrom
0.02 0.005517 2 2919 write
0.01 0.001685 1 2909 read
0.00 0.000070 70 1 1 connect
0.00 0.000020 20 1 socket
0.00 0.000010 1 18 epoll_ctl
0.00 0.000004 1 6 sendto
0.00 0.000004 1 4 fcntl
0.00 0.000000 0 1 close
0.00 0.000000 0 1 getpeername
0.00 0.000000 0 1 setsockopt
0.00 0.000000 0 1 getsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 25.073354 202359 36 total
2. Running with a dummy entrypoint and docker exec (0% CPU)
strace: Process 31394 attached with 9 threads
[...]
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
67.32 12.544007 102 123355 nanosleep
14.94 2.784310 39216 71 33 futex
14.01 2.611210 869 3005 epoll_wait
2.01 0.373797 6 66234 poll
1.15 0.213487 71 2999 recvfrom
0.41 0.076113 15223 5 1 restart_syscall
0.09 0.016295 5 3004 write
0.08 0.014458 5 3004 read
------ ----------- ----------- --------- --------- ----------------
100.00 18.633677 201677 34 total
Note that in the first case i started strace slightly earlier so there are some different calls which can all be traced back to initialization code.
The only difference I could find is the line Process PID=12621 runs in x32 mode. when using docker run. Could this be an issue?
Also note that in both measurements the total runtime is about 20 seconds while the process was running for five minutes.
Some further investigations on the 100% CPU case. I checked the process with top -H -p <pid> and only the parent process was using 100% CPU while the child threads were all mostly idling. But when calling strace -p <pid> on the parent process I could verify, that the process did not do anything (no output was generated).
So I do have a process which is using one whole core of my CPU doing exactly nothing.
As it turned out some legacy part of the software was waiting for console input in a while loop:
while (!finished) {
std::cin >> command;
processCommand(command)
}
So this worked fine when running locally and with docker exec. But since the executable was started as a docker service, there was no console present. Therefore std::cin was non-blocking and returned immediately. This way we created an endless loop without any sleeps which naturally caused a 100% CPU usage.
Thanks to #Botje for guiding us through the debugging process.
I feel like this must have a simple answer, but I really don't know how to approach this.
For background, the stack of things is like this:
Python script -> C++ binary -(fork)-> actual thing we want to measure.
Essentially, we have a python script that simulates an environment by using tmp directories and running multiple instances of this network software stack we're developing. The script calls a host binary (which is unimportant here), and then, after it loads, a helper binary. The helper binary can be passed a parameter to daemonize, and when it does this, it forks in the usual way.
What we need to do is measure the daemon's CPU utilization, but I don't really know how to. What I have done is read the stat file periodically, but since the process daemonizes, I can't use echo $! to get its PID. Using ps aux | grep 'thing' works fine, but I think this is giving me the parent process, because the stat information looks like this:
1472582561 9455 (nlsr) S 1 9455 9455 0 -1 4218944 394 0 0 0 13 0 0 0 20 0 2 0 909820 184770560 3851 18446744073709551615 4194304 5318592 140734694817376 140734694810512 140084250723843 0 0 16781312 0 0 0 0 17 0 0 0 0 0 0 7416544 7421528 16224256 140734694825496 140734694825524 140734694825524 140734694825962 0
I know that the parent process should not be PID1, and definitely the utime field and similar should be greater than 13 clock ticks. This is what is leading me to conclude that this process is really the parent process, and not the forked child that's doing all the work.
I can modify pretty much any file necessary, but because of code review constraints, design specs., etc., the less I have to change along many files, the better.
Get the PID of the child reliably
fork() returns the PID of the child to the parent
Get the CPU stats from /proc/[PID]/stat
#14 utime - CPU time spent in user code, measured in clock ticks
I am currently working on a project which seeks to analyze the vibrations of a laundry machine throughout its washing cycle.
The program runs on a Raspberry Pi 3, and gets the vibration information from a vibration sensor, which sends out the state using a digital signal (as opposed to an analog signal)
Our current idea is the implement a program which "machine learns" the cycles by recording the first 10 times the laundry machine runs, and stores that information in a text file.
The program does this by running the following loop every 10 milliseconds.
for(;;) {
read state of /sys/class/gpio/export/
write the state into a vector<bool>
analyze the bool vector to figure out when the machine is turned on or off (compare against a template vector read from an external txt file)
if the machine just got turned on, cout a message
if the machine just gor turned off, cout a message
}
A few problems with our current analysis model:
Because we are recording data every 10 milliseconds, there are always going to be small lags which would make analyzing the live boolian against the template very difficult:
for example:
live boolian vector
0 1 0 1 1 1 0 0 1 1 0
template:
0 0 0 1 0 1 1 1 0 0 1 1 0
even a minute shift in the live data would render the analysis useless. Therefore, what kind of method could I use to be able to shift the data so that the live data:
0 1 0 1 1 1 0 0 1 1 0
can be compared against the shifted template
0 0 0 1 0 1 1 1 0 0 1 1 0
My guess would be we would need to program an algorithm which constantly checks multiple shifts in both directions to find the shift which is statistically most likely to be the true position of the vector.
Thank you
I'm having trouble adjusting my thinking to suit OpenMP's way of doing things.
Roughly, what I want is:
for(int i=0; i<50; i++)
{
doStuff();
thread t;
t.start(callback(i)); //each time around the loop create a thread to execute callback
}
I think I know how this would be done in c++11, but I need to be able to accomplish something similar with OpenMP.
The closest thing to what you want are OpenMP tasks, available in OpenMP v3.0 and later compliant compilers. It goes like:
#pragma omp parallel
{
#pragma omp single
for (int i = 0; i < 50; i++)
{
doStuff();
#pragma omp task
callback(i);
}
}
This code will make the loop execute in one thread only and it will create 50 OpenMP tasks that will call callback() with different parameters. Then it will wait for all tasks to finish before exiting the parallel region. Tasks will be picked (possibly at random) by idle threads to be executed. OpenMP imposes an implicit barrier at the end of each parallel region since its fork-join execution model mandates that only the main thread runs outside of parallel regions.
Here is a sample program (ompt.cpp):
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
void callback (int i)
{
printf("[%02d] Task stated with thread %d\n", i, omp_get_thread_num());
sleep(1);
printf("[%02d] Task finished\n", i);
}
int main (void)
{
#pragma omp parallel
{
#pragma omp single
for (int i = 0; i < 10; i++)
{
#pragma omp task
callback(i);
printf("Task %d created\n", i);
}
}
printf("Parallel region ended\n");
return 0;
}
Compilation and execution:
$ g++ -fopenmp -o ompt.x ompt.cpp
$ OMP_NUM_THREADS=4 ./ompt.x
Task 0 created
Task 1 created
Task 2 created
[01] Task stated with thread 3
[02] Task stated with thread 2
Task 3 created
Task 4 created
Task 5 created
Task 6 created
Task 7 created
[00] Task stated with thread 1
Task 8 created
Task 9 created
[03] Task stated with thread 0
[01] Task finished
[02] Task finished
[05] Task stated with thread 2
[04] Task stated with thread 3
[00] Task finished
[06] Task stated with thread 1
[03] Task finished
[07] Task stated with thread 0
[05] Task finished
[08] Task stated with thread 2
[04] Task finished
[09] Task stated with thread 3
[06] Task finished
[07] Task finished
[08] Task finished
[09] Task finished
Parallel region ended
Note that tasks are not executed in the same order they were created in.
GCC does not support OpenMP 3.0 in versions older than 4.4. Unrecognised OpenMP directives are silently ignored and the resulting executable will that code section in serial:
$ g++-4.3 -fopenmp -o ompt.x ompt.cpp
$ OMP_NUM_THREADS=4 ./ompt.x
[00] Task stated with thread 3
[00] Task finished
Task 0 created
[01] Task stated with thread 3
[01] Task finished
Task 1 created
[02] Task stated with thread 3
[02] Task finished
Task 2 created
[03] Task stated with thread 3
[03] Task finished
Task 3 created
[04] Task stated with thread 3
[04] Task finished
Task 4 created
[05] Task stated with thread 3
[05] Task finished
Task 5 created
[06] Task stated with thread 3
[06] Task finished
Task 6 created
[07] Task stated with thread 3
[07] Task finished
Task 7 created
[08] Task stated with thread 3
[08] Task finished
Task 8 created
[09] Task stated with thread 3
[09] Task finished
Task 9 created
Parallel region ended
For example have a look to http://en.wikipedia.org/wiki/OpenMP.
#pragma omp for
is your friend. OpenMP does not need you to think about threading. You just declare(!) what you want to be run in parallel and the OpenMP compatible compiler performs the needed transformations in your code during compile time.
The specifications of OpenMP are also very enligthing. They explain quite well what can be done and how: http://openmp.org/wp/openmp-specifications/
Your sample could look like:
#pragma omp parallel for
for(int i=0; i<50; i++)
{
doStuff();
thread t;
t.start(callback(i)); //each time around the loop create a thread to execute callback
}
Everything in the for loop is run in parallel. You have to pay attention to data dependency. The 'doStuff()' function is run sequentielly in your pseudo code, but would be run in parallel in my sample. You also need to specifiy which variables are thread private and something like that which would also go into the #pragma statement.