MPI-size and number of OpenMP-Threads - c++

I am trying to write a hybrid OpenMP/MPI-program, and am therefore trying to understand the correlation between the number of OpenMP-Threads and MPI-processes. Therefore, I created a small test program:
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
int main(int args, char *argv[]) {
int rank, nprocs, thread_id, nthreads, cxx_procs;
MPI_Init(&args, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel private(thread_id, nthreads, cxx_procs)
thread_id = omp_get_thread_num();
nthreads = omp_get_num_threads();
cxx_procs = std::thread::hardware_concurrency();
std::stringstream omp_stream;
omp_stream << "I'm thread " << thread_id
<< " out of " << nthreads
<< " on MPI process nr. " << rank
<< " out of " << nprocs
<< ", while hardware_concurrency reports " << cxx_procs
<< " processors\n";
std::cout << omp_stream.str();
return 0;
which is compiled using
mpicxx -fopenmp -std=c++17 -o omp_mpi source/main.cpp -lgomp
with gcc-9.3.1 and OpenMPI 3.
Now, when executing it on an i7-6700 with 4c/8t with ./omp_mpi, I get the following output
I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
i.e. as expected.
When executing it using mpirun -n 1 omp_mpi I would expect the same, but instead I get
I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
Where are the other threads? When executing it on two MPI-processes instead, I get
I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
i.e. still only two OpenMP-threads, but when executing it on four MPI-processes, I get
I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
Now suddenly I get eight OpenMP-Threads per MPI-processes. Where does that change come from?

You are observing an interaction between a peculiarity of Open MPI and the GNU OpenMP Runtime libgomp.
First, the number of threads in OpenMP is controlled by the num-threads ICV (internal control variable) and the way to set it is to either call omp_set_num_threads() or by setting OMP_NUM_THREADS in the environment. When OMP_NUM_THREADS is not set and one does not call omp_set_num_threads(), the runtime is free to choose whatever it deems reasonable as a default. In the case of libgomp, the the manual says:
Specifies the default number of threads to use in parallel regions. The value of this variable shall be a comma-separated list of positive integers; the value specifies the number of threads to use for the corresponding nested level. Specifying more than one item in the list will automatically enable nesting by default. If undefined one thread per CPU is used.
What it fails to mention is that it uses various heuristics to determine the right number of CPUs. On Linux and Windows, the process affinity mask is used for that (if you like to read code, the one for Linux is right here). If the process is bound to a single logical CPU, you only get one thread:
$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
If you bind it to several logical CPUs, their count is used:
$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
This behaviour specific to libgomp interacts with another behaviour specific to Open MPI. Back in 2013, Open MPI changed its default binding policy. The reasons are somewhat a mix of technical reasons and politics and you can read more on Jeff Squyres' blog (Jeff is a core Open MPI developer).
The moral of the story is:
Always set the number of OpenMP threads and the MPI binding policy explicitly. With Open MPI, the way to set environment variables is with -x:
$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
Note that I have hyperthreading enabled and so --bind-to core and --bind-to hwthread produce different results without explicitly setting OMP_NUM_THREADS:
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
--map-by node:PE=3 gives each MPI rank three processing elements (PEs) per node. When binding to core, a PE is a core. When binding to hardware threads, a PE is a thread and one should use --map-by node:PE=#cores*#threads, i.e., --map-by node:PE=6 in my case.
Whether the OpenMP runtime respects the affinity mask set by MPI and whether it maps its own thread affinity onto it, and what to do if not, is a completely different story.

The man page for mpirun explains:
If you are simply looking for how to run an MPI application,
you probably want to use a command line of the following form:
% mpirun [ -np X ] [ --hostfile <filename> ] <program>
This will run X copies of in your current
run-time environment (...)
Please note that mpirun automatically binds processes as of the
start of the v1.8 series. Three binding patterns are used in
the absence of any further directives:
Bind to core: when the number of processes is <= 2
Bind to socket: when the number of processes is > 2
Bind to none: when oversubscribed
If your application uses threads, then you probably want to ensure
that you are either not bound at all
(by specifying --bind-to none), or bound to multiple cores
using an appropriate binding level or specific number
of processing elements per application process.
Now, if you specify 1 or 2 MPI processes, mpirun defaults to --bind-to core, which results in 2 threads per MPI process.
If, however, you specify 4 MPI processes, mpirun defaults to --bind-to socket and you have 8 threads per process, as your machine is a single-socket one. I tested it on a laptop (1s/2c/4t) and a workstation (2 sockets, 12 cores per socket, 2 threads per core) and the program (with no np argument) behaves as specified above: for the workstation there are 24 MPI processes with 24 OpenMP threads each.


Why i get nothing with PCM(Intel performance counter monitor) -- c++ API

root#dellr740:/ycsb_build# sudo ./ycsb
IBRS and IBPB supported : yes
STIBP supported : yes
Spec arch caps supported : yes
Number of physical cores: 20
Number of logical cores: 40
Number of online logical cores: 40
Threads (logical cores) per physical core: 2
Num sockets: 1
Physical cores per socket: 20
Core PMU (perfmon) version: 4
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2100000000 Hz
IBRS enabled in the kernel : yes
STIBP enabled in the kernel : no
The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS : yes
Package thermal spec power: 125 Watt; Package minimum power: 63 Watt; Package maximum power: 307 Watt;
ERROR: Secure Boot detected. Recompile PCM with -DPCM_USE_PERF or disable Secure Boot.
Socket 0: 2 memory controllers detected with total number of 6 channels. 0 QPI ports detected. 2 M2M (mesh to memory) blocks detected.
result here, pcm metric, i get 0 byte
WARNING: Custom counter 0 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Custom counter 1 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Custom counter 2 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Custom counter 3 is in use. MSR_PERF_GLOBAL_INUSE on core 0: 0x800000000000000f
WARNING: Core 0 IA32_PERFEVTSEL0_ADDR are not zeroed 1245244
Error opening PCM: 2
Zeroed PMU registers
Cleaning up
Zeroed uncore PMU registers
PCM Metrics:
L2 HitRatio: 1
L3 HitRatio: -1
L2 misses: 0
L3 misses: 0
DRAM Reads (bytes): 0
DRAM Writes (bytes): 0
NVM Reads (bytes): 0
NVM Writes (bytes): 0

Why does the loop in openmp run sequentially?

I try run example for scheduling in openmp, but its work sequentially.
#pragma omp parallel for schedule(static, 3)
for (int i = 0; i < 20; i++)
printf("Thread %d is running number %d\n", omp_get_thread_num(), i);
Thread 0 is running number 0
Thread 0 is running number 1
Thread 0 is running number 2
Thread 0 is running number 3
Thread 0 is running number 4
Thread 0 is running number 5
Thread 0 is running number 6
Thread 0 is running number 7
Thread 0 is running number 8
Thread 0 is running number 9
Thread 0 is running number 10
Thread 0 is running number 11
Thread 0 is running number 12
Thread 0 is running number 13
Thread 0 is running number 14
Thread 0 is running number 15
Thread 0 is running number 16
Thread 0 is running number 17
Thread 0 is running number 18
Thread 0 is running number 19
How can I get the code to work in parallel?
I am using Microsoft Visual Studio 2017.
In Microsoft Visual Studio, OpenMP support is disabled by default. You can enable it with the /openmp compiler option.
This option can be enabled in the project properties, under C/C++->Language->Open MP Support.

Niether 'MPI_Barrier' nor 'BLACS_Barrier' doesn't stop a processors executing its commands

I'm working on ScaLAPACK and trying to get used to BLACS routines which is essential using ScaLAPACK.
I've had some elementary course on MPI, so have some rough idea of MPI_COMM_WORLD stuff, but has no deep understanding on how it works internally and so on.
Anyway, I'm trying following code to say hello using BLACS routine.
program hello_from_BLACS
use MPI
implicit none
integer :: info, nproc, nprow, npcol, &
myid, myrow, mycol, &
ctxt, ctxt_sys, ctxt_all
call BLACS_PINFO(myid, nproc)
! get the internal default context
call BLACS_GET(0, 0, ctxt_sys)
! set up a process grid for the process set
ctxt_all = ctxt_sys
call BLACS_GRIDINIT(ctxt_all, 'c', nproc, 1)
call BLACS_BARRIER(ctxt_all, 'A')
! set up a process grid of size 3*2
ctxt = ctxt_sys
call BLACS_GRIDINIT(ctxt, 'c', 3, 2)
if (myid .eq. 0) then
write(6,*) ' myid myrow mycol nprow npcol'
(**) call BLACS_BARRIER(ctxt_sys, 'A')
! all processes not belonging to 'ctxt' jump to the end of the program
if (ctxt .lt. 0) goto 1000
! get the process coordinates in the grid
call BLACS_GRIDINFO(ctxt, nprow, npcol, myrow, mycol)
write(6,*) 'hello from process', myid, myrow, mycol, nprow, npcol
1000 continue
! return all BLACS contexts
call BLACS_EXIT(0)
end program
and the output with 'mpirun -np 10 ./exe' is like,
hello from process 0 0 0 3 2
hello from process 4 1 1 3 2
hello from process 1 1 0 3 2
myid myrow mycol nprow npcol
hello from process 5 2 1 3 2
hello from process 2 2 0 3 2
hello from process 3 0 1 3 2
Everything seems to work fine except that 'BLACS_BARRIER' line, which I marked (**) in the code's leftside.
I've put that line to make the output like below whose title line always printed at the top of the it.
myid myrow mycol nprow npcol
hello from process 0 0 0 3 2
hello from process 4 1 1 3 2
hello from process 1 1 0 3 2
hello from process 5 2 1 3 2
hello from process 2 2 0 3 2
hello from process 3 0 1 3 2
So the question goes,
I've tried BLACS_BARRIER to 'ctxt_sys', 'ctxt_all', and 'ctxt' but all of them does not make output in which the title line is firstly printed. I've also tried MPI_Barrier(MPI_COMM_WORLD,info), but it didn't work either. Am I using the barriers in the wrong way?
In addition, I got SIGSEGV when I used BLACS_BARRIER to 'ctxt' and used more than 6 processes when executing mpirun. Why SIGSEGV takes place in this case?
Thank you for reading this question.
To answer your 2 questions (in future it is best to give then separate posts)
1) MPI_Barrier, BLACS_Barrier and any barrier in any parallel programming methodology I have come across only synchronises the actual set of processes that calls it. However I/O is not dealt with just by the calling process, but at least one and quite possibly more within the OS which actually the process the I/O request. These are NOT synchronised by your barrier. Thus ordering of I/O is not ensured by a simple barrier. The only standard conforming ways that I can think of to ensure ordering of I/O are
Have 1 process do all the I/O or
Better is to use MPI I/O either directly, or indirectly, via e.g. NetCDF or HDF5
2) Your second call to BLACS_GRIDINIT
call BLACS_GRIDINIT(ctxt, 'c', 3, 2)
creates a context for 3 by 2 process grid, so holding 6 process. If you call it with more than 6 processes, only 6 will be returned with a valid context, for the others ctxt should be treated as an uninitialised value. So for instance if you call it with 8 processes, 6 will return with a valid ctxt, 2 will return with ctxt having no valid value. If these 2 now try to use ctxt anything is possible, and in your case you are getting a seg fault. You do seem to see that this is an issue as later you have
! all processes not belonging to 'ctxt' jump to the end of the program
if (ctxt .lt. 0) goto 1000
but I see nothing in the description of BLACS_GRIDINIT that ensures ctxt will be less than zero for non-participating processes - at it says
This routine creates a simple NPROW x NPCOL process grid. This process
grid will use the first NPROW x NPCOL processes, and assign them to
the grid in a row- or column-major natural ordering. If these
process-to-grid mappings are unacceptable, BLACS_GRIDINIT's more
complex sister routine BLACS_GRIDMAP must be called instead.
There is no mention of what ctxt will be if the process is not part of the resulting grid - this is the kind of problem I find regularly with the BLACS documentation. Also please don't use goto, for your own sake. You WILL regret it later. Use If ... End If. I can't remember when I last used goto in Fortran, it may well be over 10 years ago.
Finally good luck in using BLACS! In my experience the documentation is often incomplete, and I would suggest only using those calls that are absolutely necessary to use ScaLAPACK and using MPI, which is much, much better defined, for the rest. It would be so much nicer if ScaLAPACK just worked with MPI nowadays.

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
mpiexec -n 1 mytest.exe
mpiexec -n 2 mytest.exe
mpiexec -n 4 mytest.exe
mpiexec -n 8 mytest.exe
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

how does this fork() work

can you tell me, why is output of this program this:
And quick explanation why is that like this? Thanks
The output of your program, assuming no calls to fork fail, should be thought of like this:
2 2
3 3
4 4 4 4
5 5 5 5 5 5
Each column represents the output of one process. They all get serialized onto stdout in some random order, subject only to the following constraints: within a column, each character cannot appear before the character immediately above it; the topmost character in each column cannot appear before the character above and to the left of it.
Note that right now your program is relying on the C library noticing that stdout is a terminal and therefore setting it to line-buffered. If you run the program with stdout redirected to a file or pipe you are likely to get rather different output, e.g.
$ ./a.out | tr '\n' ' '
1 2 5 1 2 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
... because in that case all of the output is buffered until returning from main, and the buffers are copied into each child process. Adding
setvbuf(stdout, 0, _IONBF, 0);
before the first printf statement will prevent duplication of output. (In this case you could get away with _IOLBF instead, but _IONBF is safer for code like this.)
It would be a little easier to show graphically, but everytime you call fork() you have another process continue through the same code. So:
Process 1 (original process): prints 1, then creates Process 2, prints 2, then creates Process 3 but doesn't return 0, and prints 5.
Process 2: prints 2, then creates Process 4 but doesn't return 0, and prints 5.
Process 3: prints 3, then creates Process 5, prints 4, prints 5
Process 4: prints 3, then creates Process 6, prints 4, prints 5
Process 5: prints 4, prints 5
Process 6: prints 4, prints 5
But they are all happening in similar time so that why you get all those numbers.
Hope that helps. First time answering!
See in some flavor suppose in fedora parent get chance first to execute and then child but in other like ubuntu child get first preference on the basis of that you will see the out put. No relation with printf function in this scope but we can predict how many times the body of the method will execute here I am attaching one image may it is helpful for you.
Here when 1 print only one process. After execution of first fork then two different process each is having fork but inside your if statement. So one again creates two process but only one will get the chance to enter in the if body. Again fork will execute and again new process will generates. The formula is total number of process=2*n. Where n is the number of fork() method inside your function. So total six methods will have some condition to printing any number like 2,3,4 but 5 is common to all so 5 will print six times.
May be my post helpful for you
asif aftab
Its since you're not always testing for the fork() calls result. The zero result path will remain the parent process and the else part will be executed as child process. As you're not testing that every code following a fork(); call will be duplicated (and executed) in both processes.
the output is not deterministic due to the order of execution and inheritance of the output buffers, a way to be deterministic is with
#include <stdlib.h>
#include <stdio.h>
if (fork()) wait(0);
if (fork()) wait(0);
} else wait();
Using fork() we create a child process, and there is no execution pattern gurantee for either parent or child, as discussed here.
If you want to have an execution pattern, its better to put a check on fork() for the parent [pid is not 0] or child [pid is 0] and make either of them to sleep so that scheduler puts the other one on execution.
You can find more information here.
Micheal, I would need to see the code inside of the fork() method to know for sure, but being that it is printing numbers extra numbers, the only possible explanation that I can think of is that your fork() method might have print methods of its own.